Sébastien Nussbaumer created TIKA-2874:
------------------------------------------

             Summary: Parsing of 4 mb excel file generates 163 mb worth of words
                 Key: TIKA-2874
                 URL: https://issues.apache.org/jira/browse/TIKA-2874
             Project: Tika
          Issue Type: Bug
          Components: parser
    Affects Versions: 1.20
            Reporter: Sébastien Nussbaumer
         Attachments: tika-config.xml

When I parse the attached 4 mb excel file, I get 163 mb worth of words. When 
checking out the words I see that some cells are repeated *many hundred 
thousand* of times.

I tried passing the words through the uniq linux command line utility and got a 
file with a much more reasonnable 16 kb file.

This is the code I use : 

{code:java}
TikaConfig config = new TikaConfig(new 
ClassPathResource("tika-config.xml").getURL());
Detector detector = config.getDetector();
Parser autoDetectParser = new AutoDetectParser(config);
Tika tika = new Tika(detector, autoDetectParser);
try (LanguageWriter languageWriter = new 
LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels());
        OutputStreamWriter outputStreamWriter = new OutputStreamWriter(output, 
StandardCharsets.UTF_8);
        CompositeWriter compositeWriter = new 
CompositeWriter(outputStreamWriter, languageWriter)) {

    WriteOutContentHandler handler = new 
WriteOutContentHandler(compositeWriter, indexedChars);
    ParseContext context = new ParseContext();
    context.set(Parser.class, tika.getParser());
    tika.getParser().parse(input, new BodyContentHandler(handler), new 
Metadata(), context);
} 
{code}




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to