[jira] [Comment Edited] (TIKA-2874) Parsing of 4 mb excel file generates 164 mb worth of words

Tim Allison (JIRA) Wed, 15 May 2019 05:45:37 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2874?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16840365#comment-16840365
 ]


Tim Allison edited comment on TIKA-2874 at 5/15/19 12:44 PM:
-------------------------------------------------------------

Well, that's exciting! :P  

When you decompress the zip file, the first sheet is 100MB.

When I look at the shared strings file, I see: {{count="331360" 
uniqueCount="143"}} so, yeah, there's a lot of duplicated data, but this isn't 
a Tika problem...I don't think.  The issue is that you can't actually see this 
easily in Excel.

In short, between the decompression and "pointer" used to reference the shared 
strings file, I'm not surprised that you're getting 150MB.


was (Author: [email protected]):
Well, that's exciting! :P  

When you decompress the zip file, the first sheet is 100MB.

When I look at the shared strings file, I see: {{count="331360" 
uniqueCount="143"}} so, yeah, there's a lot of duplicated data, but this isn't 
a Tika problem...I don't think.  The issue is that you can't actually see this 
easily in Excel.

> Parsing of 4 mb excel file generates 164 mb worth of words
> ----------------------------------------------------------
>
>                 Key: TIKA-2874
>                 URL: https://issues.apache.org/jira/browse/TIKA-2874
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.20, 1.19.1
>            Reporter: Sébastien Nussbaumer
>            Priority: Major
>         Attachments: excel_that_generates_huge_number_of_words.xlsx, 
> tika-config.xml
>
>
> When I parse the attached 4 mb excel file, I get 164 mb worth of words. When 
> checking out the words I see that some cells are repeated *many hundred 
> thousand* of times.
> I tried passing the words through the uniq linux command line utility and got 
> a file with a much more reasonnable size of 16 kb.
> This is the code I use : 
> {code:java}
> TikaConfig config = new TikaConfig(new 
> ClassPathResource("tika-config.xml").getURL());
> Detector detector = config.getDetector();
> Parser autoDetectParser = new AutoDetectParser(config);
> Tika tika = new Tika(detector, autoDetectParser);
> try (LanguageWriter languageWriter = new 
> LanguageWriter(LanguageDetector.getDefaultLanguageDetector().loadModels());
>         OutputStreamWriter outputStreamWriter = new 
> OutputStreamWriter(output, StandardCharsets.UTF_8);
>         CompositeWriter compositeWriter = new 
> CompositeWriter(outputStreamWriter, languageWriter)) {
>     WriteOutContentHandler handler = new 
> WriteOutContentHandler(compositeWriter, indexedChars);
>     ParseContext context = new ParseContext();
>     context.set(Parser.class, tika.getParser());
>     tika.getParser().parse(input, new BodyContentHandler(handler), new 
> Metadata(), context);
> } 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (TIKA-2874) Parsing of 4 mb excel file generates 164 mb worth of words

Reply via email to