[ 
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Luis Filipe Nassif updated TIKA-1483:
-------------------------------------
    Attachment: TIKA-1483_v2.patch

Hi [~gostep], thank you for the review.

I removed the carriage returns (part of my eclipse project was using Windows 
newlines).

I think the a/b path prefixes are eclipse/git defaults. You can apply the patch 
using the patch -p1 option.

I renamed the class to Latin1StringsParser to reflect its goal, as it is much 
less general than proposed in the ticket. Also adjusted some comments and the 
unit test to use unicode char escapes within the strings.

Should the minSize parameter be configured with a configuration object instead 
of a setter method? I think this is a bit expense for only one parameter.

> Create a general raw string parser
> ----------------------------------
>
>                 Key: TIKA-1483
>                 URL: https://issues.apache.org/jira/browse/TIKA-1483
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Luis Filipe Nassif
>         Attachments: TIKA-1483.patch, TIKA-1483_v2.patch
>
>
> I think it can be very useful adding a general parser able to extract raw 
> strings from files (like the strings command), which can be used as the 
> fallback parser for all mimetypes not having a specific parser 
> implementation, like application/octet-stream. It can also be used as a 
> fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files 
> (currently I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets 
> within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to