Re: Indexing Solr with the web crawler

Erlend Garåsen Tue, 25 Jan 2011 01:59:59 -0800

On 24.01.11 14.48, Karl Wright wrote:

Thanks for the information.
What I'd like to do is wait until your research is done and then post
the rough instructions to d...@lucene.apache.org for confirmation that
your approach is the preferred one.  I'd also like to know if you
check out the latest solr release from the svn tag and just build it,
whether you have any of these problems.  I've been building
solr/lucene trunk and not using the binary distribution, which may be
why I never noticed that this has gone away in the main distribution.

OK, it might take a week or so, but here are some details I just figuredout:- There is a bug with the current Solr release (1.4.1) which makes itimpossible to extract the content by using the ExtractingRequestHandler.I think it is related to this Jira issue:

https://issues.apache.org/jira/browse/SOLR-1902?page=com.atlassian.jira.plugin.ext.subversion%3Asubversion-commits-tabpanel

- This issue is now fixed, and if I check out the latest release fromtrunk, content can now be extracted by Tika.

What I need to test is whether I need to place the tika/extracting jarsmanually in a lib folder when I deploy solr.war on Resin by using thelatest trunk version from SVN. When this is done, I can inform you.

Anyway, I don't like to build a search application for my university byusing the latest version from trunk, I would rather prefer to use anofficial release. So maybe I will try to implement the changes fromtrunk instead. I can already now see that Tika has a newer version intrunk compared to the official 1.4.1 release, i.e. tika-core-0.8.jarinstead of tika-core-0.4.jar.


Erlend

--
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Re: Indexing Solr with the web crawler

Reply via email to