Sébastien Nussbaumer created TIKA-2874:
------------------------------------------
Summary: Parsing of 4 mb excel file generates 163 mb worth of words
Key: TIKA-2874
URL: https://issues.apache.org/jira/browse/TIKA-2874
Project: Tika
Issue Type: Bug
Components: parser
Affects Versions: 1.20
Reporter: Sébastien Nussbaumer
Attachments: tika-config.xml
When I parse the attached 4 mb excel file, I get 163 mb worth of words. When
checking out the words I see that some cells are repeated *many hundred
thousand* of times.
I tried passing the words through the uniq linux command line utility and got a
file with a much more reasonnable 16 kb file.
This is the code I use :
{code:java}
TikaConfig config = new TikaConfig(new
ClassPathResource("tika-config.xml").getURL());
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
Tika tika = new Tika(detector, autoDetectParser);
try (LanguageWriter languageWriter = new
LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels());
OutputStreamWriter outputStreamWriter = new OutputStreamWriter(output,
StandardCharsets.UTF_8);
CompositeWriter compositeWriter = new
CompositeWriter(outputStreamWriter, languageWriter)) {
WriteOutContentHandler handler = new
WriteOutContentHandler(compositeWriter, indexedChars);
ParseContext context = new ParseContext();
context.set(Parser.class, tika.getParser());
tika.getParser().parse(input, new BodyContentHandler(handler), new
Metadata(), context);
}
{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)