[ https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336107#comment-14336107 ]
Chris A. Mattmann commented on TIKA-1483: ----------------------------------------- [~lfcnassif] I tried to apply this patch and am getting this: {noformat} [chipotle:~/tmp/tika] mattmann% patch -p1 < TIKA-1483_v2.patch patching file tika-parsers/src/main/java/org/apache/tika/parser/strings/Latin1StringsParser.java patching file tika-parsers/src/test/java/org/apache/tika/parser/strings/Latin1StringsParserTest.java patch unexpectedly ends in middle of line patch: **** malformed patch at line 403: [chipotle:~/tmp/tika] mattmann% {noformat} Any ideas? > Create a general raw string parser > ---------------------------------- > > Key: TIKA-1483 > URL: https://issues.apache.org/jira/browse/TIKA-1483 > Project: Tika > Issue Type: New Feature > Components: parser > Affects Versions: 1.6 > Reporter: Luis Filipe Nassif > Attachments: TIKA-1483.patch, TIKA-1483_v2.patch > > > I think it can be very useful adding a general parser able to extract raw > strings from files (like the strings command), which can be used as the > fallback parser for all mimetypes not having a specific parser > implementation, like application/octet-stream. It can also be used as a > fallback for corrupt files throwing a TikaException. > It must be configured with the script/language to be extracted from the files > (currently I implemented one specific for Latin1). > It can use heuristics to extract strings encoded with different charsets > within the same file, mainly the common ISO-8859-1, UTF8 and UTF16. > What the community thinks about that? -- This message was sent by Atlassian JIRA (v6.3.4#6332)