[
https://issues.apache.org/jira/browse/TIKA-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sébastien Nussbaumer updated TIKA-2874:
---------------------------------------
Affects Version/s: 1.19.1
> Parsing of 4 mb excel file generates 163 mb worth of words
> ----------------------------------------------------------
>
> Key: TIKA-2874
> URL: https://issues.apache.org/jira/browse/TIKA-2874
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.20, 1.19.1
> Reporter: Sébastien Nussbaumer
> Priority: Major
> Attachments: excel_that_generates_huge_number_of_words.xlsx,
> tika-config.xml
>
>
> When I parse the attached 4 mb excel file, I get 163 mb worth of words. When
> checking out the words I see that some cells are repeated *many hundred
> thousand* of times.
> I tried passing the words through the uniq linux command line utility and got
> a file with a much more reasonnable 16 kb file.
> This is the code I use :
> {code:java}
> TikaConfig config = new TikaConfig(new
> ClassPathResource("tika-config.xml").getURL());
> Detector detector = config.getDetector();
> Parser autoDetectParser = new AutoDetectParser(config);
> Tika tika = new Tika(detector, autoDetectParser);
> try (LanguageWriter languageWriter = new
> LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels());
> OutputStreamWriter outputStreamWriter = new
> OutputStreamWriter(output, StandardCharsets.UTF_8);
> CompositeWriter compositeWriter = new
> CompositeWriter(outputStreamWriter, languageWriter)) {
> WriteOutContentHandler handler = new
> WriteOutContentHandler(compositeWriter, indexedChars);
> ParseContext context = new ParseContext();
> context.set(Parser.class, tika.getParser());
> tika.getParser().parse(input, new BodyContentHandler(handler), new
> Metadata(), context);
> }
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)