Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

Peter Manis Tue, 21 Aug 2007 08:31:25 -0700

Installing the patch requires downloading the latest solr via
subversion and applying the patch to the source.  Eric has updated his
patch with various revisions of subversion.  To make sure it will
compile I suggest getting the revision he lists.

As for using the features of this patch.  This is the url that would be called

/solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description

Breaking this down

You have stream.file which will be the absolute path to the file you
want to index.  You then have stream.type which specifies the type of
file, which currently supports pdf, xls, doc, ppt.  The next field is
the id, which is where you specify the unique value for the id in your
schema.  Example is we had a document reference in a database, and
that id was 103, so we would specify the value 103 to identify which
document it was in the index.  Stream.fieldname is the name of the
field in your index that will actually be storing the text from the
document.  We had the field 'data' so it would be
stream.fieldname=data in the url.

The parameter fieldnames is any additional fields in your index that
need to be filled.  We were passing a category, description for the
document, a name, and the type.  So you just need to specify the names
of the fields.  Solr will then look for corresponding parameters with
those names, which you can see at the end of my URL.  The values
passed for the additional parameters need to be sent url encoded.

I'm not a Java programmer so if you have questions about the internals
of the code, definitely direct those to Eric as I cannot help.  I have
only implemented it in web applications.  If you have any other
questions about the use of the patch I can answer those questions.

Enjoy!

 - Pete

On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
> There seems to be some code out for Tika now (not packaged/announced yet,
> but...). Could someone please take a look at it and see if that could fit
> in? I am eagerly waiting for a reply back from tika-dev, but no luck yet.
>
> http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/
>
> I see that Eric's patch uses POI (for most of it)...so that's great! I have
> seen too many duplicated efforts, even in Apache projects alone, and this is
> one step close to fixing it (other than Tika, which isnt' 'complete' yet).
> Are there any plans on releasing this patch with Solr dist? Or, any
> instructions on using/installing the patch itself?
>
> Thanks
> Vish
>
>
> On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> >
> > Christian,
> >
> > Eric Pugh created implemented this functionality for a project we were
> > doing and has released to code on JIRA.  We have had very good results
> > with it.  If I can be of any help using it beyond the Java code itself
> > let me know.  The last revision I used with it was 552853, so if the
> > build happens to fail you can roll back to that and it will work.
> >
> > https://issues.apache.org/jira/browse/SOLR-284
> >
> > - Pete
> >
> > On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote:
> > > Hi Solr Users,
> > >
> > > i have set up a Solr-Server with a custom Schema.
> > > Now i have updated the index with some content form
> > > xml-files.
> > >
> > > Now i try to update the contents of a folder.
> > > The folder consits of various document-types
> > > (pdf,doc,xls,...).
> > >
> > > Is there anywhere an howto how can i parse the
> > > documents, make an xml of the paresed content
> > > and post it to the solr server?
> > >
> > > Thanks in advance.
> > >
> > > Christian
> > >
> > >
> >
>

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

Reply via email to