Re: Problem indexing Word Documents

Grant Ingersoll Mon, 26 Nov 2007 10:38:53 -0800

I would ask on the POI mailing list. This doesn't look to be aproblem with Lucene.


-Grant


On Nov 26, 2007, at 1:17 PM, chris.b wrote:

okay, so i'm very new to lucene, so it may be my bad, but i can getit toindex .txt files, and when trying to index word documents (usingpoi), theprogram starts running and when it reaches a .doc file, i get thefollowing

errors:

Exception in thread "main"

org.apache.poi.hpsf.IllegalPropertySetDataException: The propertyset claims

to have a size of 16 bytes. However, it exceeds 16 bytes.
        at org.apache.poi.hpsf.Section.<init>(Section.java:255)
        at org.apache.poi.hpsf.PropertySet.init(PropertySet.java:454)
        at org.apache.poi.hpsf.PropertySet.<init>(PropertySet.java:249)
        at

org.apache.poi.hpsf.PropertySetFactory.create(PropertySetFactory.java:61)

        at org.apache.poi.POIDocument.getPropertySet(POIDocument.java:92)
        at org.apache.poi.POIDocument.readProperties(POIDocument.java:69)
        at org.apache.poi.hwpf.HWPFDocument.<init>(HWPFDocument.java:147)
        at

org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:56)

at

org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor.java:48)

        at Indexer.indexFile(Indexer.java:76)
        at Indexer.indexDirectory(Indexer.java:57)
        at Indexer.index(Indexer.java:38)
        at Indexer.main(Indexer.java:20)

and my code is as follows:

        private static void indexFile(IndexWriter writer, File f) throws
IOException {
                if (f.isHidden() || !f.exists() || !f.canRead()) {
                        return;
                }

                System.out.println("A acrescentar " + f.getCanonicalPath() + " 
ao
indice.");

                Document doc = new Document();
                
                // For .doc files
                if (f.getName().endsWith(".doc")){
                        FileInputStream docfin = new 
FileInputStream(f.getAbsolutePath());
                        WordExtractor docextractor = new WordExtractor(docfin);
                        String content = docextractor.getText();
                        doc.add(new Field("contents", content, Field.Store.NO,
Field.Index.TOKENIZED));                        
                } // For .txt files
                else if (f.getName().endsWith(".txt")) {
                        doc.add(new Field("contents", new FileReader(f)));
                }
                
                doc.add(new Field("filename", f.getCanonicalPath(), 
Field.Store.YES,
Field.Index.TOKENIZED));
                writer.addDocument(doc);
        }

(I think i included all that's necessary)
Thanks in advance for any help.
--
View this message in context: 
http://www.nabble.com/Problem-indexing-Word-Documents-tf4876643.html#a13954702
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


--------------------------
Grant Ingersoll
http://lucene.grantingersoll.com

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Problem indexing Word Documents

Reply via email to