Oh, there's more unfortunately. Some of the Tika dependencies need to be further updated. I couldn't parse the date from PDF documents correctly. I'm not quite sure which of the extracting libraries causing this problem (probably pdfbox). Anyway, I can now extract contents from the following document formats without any problems:
- HTML
- RTF
- DOC
- DOCX
- ODT
- XLSX
- XLS
- SXW
- PDF

I'm using the following jars:
apache-solr-cell-1.4.2-dev.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
poi-scratchpad-3.7.jar
asm-3.1.jar
icu4j-4_6.jar
rome-0.9.jar
bcmail-jdk15-1.45.jar
jempbox-1.3.1.jar
tagsoup-1.2.jar
bcprov-jdk15-1.45.jar
metadata-extractor-2.4.0-beta-1.jar
tika-core-0.8.jar
boilerpipe-1.1.0.jar
netcdf-4.2.jar
tika-parsers-0.8.jar
commons-compress-1.1.jar
pdfbox-1.3.1.jar
commons-logging-1.1.1.jar
poi-3.7.jar
xercesImpl-2.8.1.jar
dom4j-1.6.1.jar
poi-ooxml-3.7.jar
xml-apis-1.0.b2.jar
fontbox-1.3.1.jar
poi-ooxml-schemas-3.7.jar
xmlbeans-2.3.0.jar

But I still have some problems with PDF documents[1]. I'm not sure whether it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be displayed correctly after Solr has indexed the document. The characters are replaced by a question mark.

[1] http://ridder.uio.no/dokument.pdf

Erlend

On 30.03.11 18.09, Karl Wright wrote:
Certainly it makes sense to start with the FAQ, especially for places
where you are tripping over known bugs.  We can always do a site page
later.

Thanks!
Karl

On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen
<[email protected]>  wrote:
On 30.03.11 18.00, Karl Wright wrote:

It would be great if this information went at least into the FAQ, and
even better if we added a page to the site documentation.  I'm
thinking maybe a whole page titled "Integrating with Solr", which
would walk you through the process and the pitfalls.  What do you
think?

Yes, I think so.

The next version of Solr will probably be released soon, and then it will be
much easier to integrate Solr. Maybe it is sufficient to add the information
into the FAQ since the problem mentioned only affects 1.4.1?

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to