Luis Filipe Nassif created TIKA-1483:
----------------------------------------
Summary: Create a general raw string parser
Key: TIKA-1483
URL: https://issues.apache.org/jira/browse/TIKA-1483
Project: Tika
Issue Type: New Feature
Components: parser
Affects Versions: 1.6
Reporter: Luis Filipe Nassif
I think it can be very useful adding a general parser able to extract raw
strings from files (like the strings command), which can be used as the
fallback parser for all mimetypes not having a specific parser implementation,
like application/octet-stream. It can also be used as a fallback for corrupt
files throwing a TikaException.
It must be configured with the script/language to be extracted from the files
(currently I implemented one specific for Latin1).
It can use heuristics to extract strings encoded with different charsets within
the same file, maily ISO-8859-1, UTF8 and UTF16.
What the community think about that?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)