Hi,
I've got a client that is interested in indexing a large number of Word
.doc files. The organization was looking at Verity/Retrievalware, but
they simply can't afford these products. I suggested that they look a
POI/Lucene solution.
They are sold on the open-source idea and would consider funding some of
my time to extend HDF, but I have to prove that HDF is a viable base to
start from. Disclaimer: I'm not sure about the number of hours they are
willing to sponsor, and I suspect you should not get exited too soon ;-)
I hope it can be enough the move HDF from scratchpad status.
I played with HDF over the weekend and the CVS version of the scratchpad
didn't seem to work at all. I used the event model for HDF. Here are
patches that I created to make it work:
diff -r1.1 EventBridge.java
290c290
< sb.append(_mainDocument[y]);
---
> sb.append((char)_mainDocument[y]);
diff -r1.10 HDFObjectFactory.java
147a148,149
>
> _listener.mainDocument(_mainDocument);
These patches worked for simple documents, but bulleted lists and tables
still broke. To fix the tables I added the following hack. With this
code in place I can get the text from the table, but it doesn't seem
like the proper solution. Maybe someone can help me here?
diff -r1.4 StyleSheet.java
1167a1168,1174
> if (brcTop == null) break;
> if (brcLeft == null) break;
> if (brcBottom == null) break;
> if (brcRight == null) break;
> if (brcVertical == null) break;
> if (brcHorizontal == null) break;
With the bulleted lists I'm stumped. I get the following exception, and
I can't even hack it to work. Anybody that can at least point me into a
direction to find this problem?
Exception in thread "main" java.lang.NegativeArraySizeException
at
org.apache.poi.hdf.model.HDFObjectFactory.createListTables(HDFObjectFactory.java:644)
at
org.apache.poi.hdf.model.HDFObjectFactory.initFormattingProperties(HDFObjectFactory.java:277)
at
org.apache.poi.hdf.model.HDFObjectFactory.<init>(HDFObjectFactory.java:155)
at
org.apache.poi.hdf.model.HDFDocument.<init>(HDFDocument.java:18)
at
org.apache.poi.hdf.test.DataExtractor.extract(DataExtractor.java:25)
at
org.apache.poi.hdf.test.DataExtractor.main(DataExtractor.java:98)
I have also created a [simple] suite of test .doc files and classes to
test various features. These classes can easily be incorporated into a
JUnit test (they are stand-alone at the moment). The test suite was
done with MS Word 2000 on a Windows 2000 machine. If you are interested
in these I can donate them to the POI project.
I would *really* like to strengthen my client's open-source drive and
this project is aimed at TOP management. If can show that POI/HDF is
viable at the end of this week I can probably make the cut-off point. I
would appreciate any help - even if it is just some pointers on how to
start looking for the above mentioned problems. Some code/bug-fixes
would be cool too ;-)
~ Leon
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
- [POSIBILITY] POI/HDF Donation Leon Messerschmidt
- [POSIBILITY] POI/HDF Donation richard
- Re: [POSIBILITY] POI/HDF Donation Andrew C. Oliver
