[
https://issues.apache.org/jira/browse/TIKA-1483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14337862#comment-14337862
]
Hudson commented on TIKA-1483:
------------------------------
SUCCESS: Integrated in tika-trunk-jdk1.7 #509 (See
[https://builds.apache.org/job/tika-trunk-jdk1.7/509/])
Fix for TIKA-1483 Create a Latin1 charset raw string parser contributed by Lius
Filipe Nassif. (mattmann:
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1662350)
*
/tika/trunk/tika-parsers/src/main/java/org/apache/tika/parser/strings/Latin1StringsParser.java
*
/tika/trunk/tika-parsers/src/test/java/org/apache/tika/parser/strings/Latin1StringsParserTest.java
Fix for TIKA-1483 Create a Latin1 charset raw string parser contributed by Lius
Filipe Nassif. (mattmann:
http://svn.apache.org/viewvc/tika/trunk/?view=rev&rev=1662349)
* /tika/trunk/CHANGES.txt
> Create a Latin1 charset raw string parser
> -----------------------------------------
>
> Key: TIKA-1483
> URL: https://issues.apache.org/jira/browse/TIKA-1483
> Project: Tika
> Issue Type: New Feature
> Components: parser
> Affects Versions: 1.6
> Reporter: Luis Filipe Nassif
> Assignee: Chris A. Mattmann
> Fix For: 1.8
>
> Attachments: TIKA-1483.patch, TIKA-1483_v2.patch
>
>
> I think it can be very useful adding a general parser able to extract raw
> strings from files (like the strings command), which can be used as the
> fallback parser for all mimetypes not having a specific parser
> implementation, like application/octet-stream. It can also be used as a
> fallback for corrupt files throwing a TikaException.
> It must be configured with the script/language to be extracted from the files
> (currently I implemented one specific for Latin1).
> It can use heuristics to extract strings encoded with different charsets
> within the same file, mainly the common ISO-8859-1, UTF8 and UTF16.
> What the community thinks about that?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)