Md created TIKA-2901:
------------------------

             Summary: Tika extracting points from Chart 
                 Key: TIKA-2901
                 URL: https://issues.apache.org/jira/browse/TIKA-2901
             Project: Tika
          Issue Type: Bug
          Components: app
    Affects Versions: 1.21
            Reporter: Md


I am using Tika to extract content from *.docx and other files. I am noticing 
Tika is extracting points from charts and putting them at the end of the file. 
I am using following code for extraction 
{code:java}
     StringBuilder fileContent = new StringBuilder();
        Parser parser = new AutoDetectParser();
        ContentHandlerFactory factory = new 
BasicContentHandlerFactory(BasicContentHandlerFactory.HANDLER_TYPE.HTML,
                -1);
        //InputStream inputStream = new BufferedInputStream(new 
FileInputStream(inputFileName));
        RecursiveParserWrapper wrapper = new RecursiveParserWrapper(parser, 
factory);
        Metadata metadata = new Metadata();

        ParseContext parseContext = new ParseContext();
        OfficeParserConfig officeParserConfig = new OfficeParserConfig();
        officeParserConfig.setUseSAXDocxExtractor(true);
        officeParserConfig.setIncludeDeletedContent(false);
        officeParserConfig.setIncludeMoveFromContent(false);
        officeParserConfig.setIncludeHeadersAndFooters(false);
        parseContext.set(OfficeParserConfig.class, officeParserConfig);

        wrapper.parse(inputStream, new DefaultHandler(), metadata, 
parseContext);
        String contents = metadata.get(RecursiveParserWrapper.TIKA_CONTENT);
        {code}

Please find the attach files for input and output from Tika. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to