Hello! 02.12.2010, в 19:11, randeel wimalagunarathne написал(а):
> Hi Max, > > yes, thats what i am trying to do. Can you help me with that? > How did you find that there are 2 xslx files and one xls file? > Thank you for providing me the help. > Word stores embedded objects in "ObjectPool" directory entry and name of that entries starts with "_" symbol. If this directory contains "Package" entry then it contains OOXML based document as a raw (ZIP) stream (you can use DocumentInputStream to read get that binary stream). Otherwise it is some OLE-based format or some binary embedded in Ole10Native stream. I recommend you to look at this two source files from Apache Tika project: 1) parse function in WordExtractor: https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java 2) handleEmbeddedOfficeDoc at AbstractPOIFSExtractor: https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/AbstractPOIFSExtractor.java best wishes, Max --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
