Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1

Erlend Garåsen Mon, 04 Apr 2011 08:32:53 -0700

Thanks for your information about the latest Solr release. I will startto look at this tomorrow.


It will probably work out-of-the box.

Erlend


On 04.04.11 17.26, Karl Wright wrote:

Good to know that it can be made to work. ;-)

We should probably look at Lucene/Solr 3.1, which was just released,
and is the next Solr version after 1.4.1, and see whether anything
special is needed there.

Karl


On Mon, Apr 4, 2011 at 11:01 AM, Erlend Garåsen<[email protected]>  wrote:


After I downloaded and replaced the following jars, I no longer have a
character encoding problem:
pdfbox-1.5.0.jar
fontbox-1.5.0.jar
jempbox-1.5.0.jar

Erlend

On 31.03.11 14.35, Karl Wright wrote:


It might be worth cross-posting this to the Tika user or dev list.
Jukka Zitting is one of the principal Tika developers and he's also a
committer for MCF, but I'm not sure he'll notice it go by otherwise.

In case you're wondering how to update the MCF FAQ, it's in the Wiki
so all you need to do is sign up and you'll be able to update it.
https://cwiki.apache.org/confluence/display/CONNECTORS/FAQ

Karl

On Thu, Mar 31, 2011 at 6:59 AM, Erlend Garåsen<[email protected]>
  wrote:


Oh, there's more unfortunately. Some of the Tika dependencies need to be
further updated. I couldn't parse the date from PDF documents correctly.
I'm
not quite sure which of the extracting libraries causing this problem
(probably pdfbox). Anyway, I can now extract contents from the following
document formats without any problems:
- HTML
- RTF
- DOC
- DOCX
- ODT
- XLSX
- XLS
- SXW
- PDF

I'm using the following jars:
apache-solr-cell-1.4.2-dev.jar
geronimo-stax-api_1.0_spec-1.0.1.jar
poi-scratchpad-3.7.jar
asm-3.1.jar
icu4j-4_6.jar
rome-0.9.jar
bcmail-jdk15-1.45.jar
jempbox-1.3.1.jar
tagsoup-1.2.jar
bcprov-jdk15-1.45.jar
metadata-extractor-2.4.0-beta-1.jar
tika-core-0.8.jar
boilerpipe-1.1.0.jar
netcdf-4.2.jar
tika-parsers-0.8.jar
commons-compress-1.1.jar
pdfbox-1.3.1.jar
commons-logging-1.1.1.jar
poi-3.7.jar
xercesImpl-2.8.1.jar
dom4j-1.6.1.jar
poi-ooxml-3.7.jar
xml-apis-1.0.b2.jar
fontbox-1.3.1.jar
poi-ooxml-schemas-3.7.jar
xmlbeans-2.3.0.jar

But I still have some problems with PDF documents[1]. I'm not sure
whether
it is a pdfbox bug, but Norwegian characters like æ, ø and å cannot be
displayed correctly after Solr has indexed the document. The characters
are
replaced by a question mark.

[1] http://ridder.uio.no/dokument.pdf

Erlend

On 30.03.11 18.09, Karl Wright wrote:


Certainly it makes sense to start with the FAQ, especially for places
where you are tripping over known bugs.  We can always do a site page
later.

Thanks!
Karl

On Wed, Mar 30, 2011 at 12:07 PM, Erlend Garåsen
<[email protected]>      wrote:


On 30.03.11 18.00, Karl Wright wrote:


It would be great if this information went at least into the FAQ, and
even better if we added a page to the site documentation.  I'm
thinking maybe a whole page titled "Integrating with Solr", which
would walk you through the process and the pitfalls.  What do you
think?


Yes, I think so.

The next version of Solr will probably be released soon, and then it
will
be
much easier to integrate Solr. Maybe it is sufficient to add the
information
into the FAQ since the problem mentioned only affects 1.4.1?

Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP:
31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050



--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: [TIP] Workaround for Solr bugs when Indexing Solr 1.4.1

Reply via email to