Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

Vish D. Tue, 21 Aug 2007 13:03:03 -0700

On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
>
> I am a little confused how you have things setup, so these meta data
> files contain certain information and there may or may not be a pdf,
> xls, doc that it is associated with?



Yes, you have it right.

If that is the case, if it were me I would write something to parse
> the meta data files, and if there is a binary file associated with it
> submit it using the url I showed you.  If the meta data is just that
> and has no associated documents submit it in XML form.  The script
> shouldn't be  too complicated, but that would depend on the complexity
> of the meta data you are parsing.
>
> To give you an idea how I use it, we have hundreds of documents in
> PDF, DOC, XLS, HTML, TXT, CSV, and PPT formats.  When a document is to
> be indexed by solr we look at the extension, if it is a txt or
> html,htm we read the data in and submit it with the xml handler.  If
> the document is one of the binary formats we submit it with the url I
> showed you.  All information about these files is stored in a database
> and some of the 'documents' in the database are just links to external
> documents.  In that case we are only indexing a description, title,
> and category.
>
> You are correct, it would overwrite the data by doing an update unless
> you parsed the meta data, and if you are parsing the meta data you
> might as well just parse it from the start and index once.
>
> How are you handling these meta data files right now?  are they simply
> xml files like in the solr example where you are just running the bash
> script on or is something parsing the contents already?


Yes, I am running a similar bash script to index these meta-data xml docs.
The big downside in using the url way is that, for one thing, it has the
characters-limit (1024, is it?). So, if I had a lot of meta-data, or even a
long description for a record, that might not work all that well. I am
guessing you haven't run into this issue yet, right?

- Pete


The proposed schema additions might not make sense for everyone, since the
actual requirements might be more complex than just that (i.e., say you want
to extract text, structure it in various elements, update your doc xml, and
then index). But, it goes well with Solr's search-engine-in-a-box
perception, but now with full-text- prefix to it. Another way I can see it
happen, is to extend the default handler and still take in a xml doc, but
look out for, say, a field name '<file>'. From here on, within the handler,
you can validate the filename, handle it anyways you want (create extra
elements, create '<pdf>' for pdf files and '<html>' for html files, etc..),
etc... This strips out having to deal with if/else scripting outside of
Solr.

Rao



On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
> > Pete,
> >
> > Thanks for the great explanation.
> >
> > Thinking it through my process, I am not sure how to use it:
> >
> > I have a bunch of docs that pretty much contain a lot of meta-data, some
> > which include full-text files (.pdf, .ppt, etc...). I use these docs
> > correctly to index/update into Solr. The next step now is to somehow
> index
> > the text from the full-text files. One way to think about it is, I could
> > have a placeholder field 'data' and keep it empty for the first pass,
> and
> > then run update/rich to index the actual full-text, but using the same
> > unique doc id. But this would actually overwrite the doc in the index,
> won't
> > it? And, there really isn't a 'merge' operation, right?
> >
> > There might be a better way to use this full-text indexing option,
> > schema-wise, say:
> > <richData source="FIELDNAME" dest="FIELDNAME" />
> > - have a new option richData that will take in a source field name,
> > - validate it's value (valid filename/file),
> > - recognize the file type,
> > - and put the 'data' into another field
> >
> > What do you think?  I am not a true Java developer, so not sure if I
> could
> > do it myself, but only hope that someone else on the project could
> ;-)...
> >
> > Rao
> >
> > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > >
> > > Installing the patch requires downloading the latest solr via
> > > subversion and applying the patch to the source.  Eric has updated his
> > > patch with various revisions of subversion.  To make sure it will
> > > compile I suggest getting the revision he lists.
> > >
> > > As for using the features of this patch.  This is the url that would
> be
> > > called
> > >
> > >
> > >
> /solr/update/rich?stream.file=filename&stream.type=filetype&id=id&stream.fieldname=storagefield&fieldnames=cat,desc,type,name&type=filetype&cat=category&name=name&desc=description
> > >
> > > Breaking this down
> > >
> > > You have stream.file which will be the absolute path to the file you
> > > want to index.  You then have stream.type which specifies the type of
> > > file, which currently supports pdf, xls, doc, ppt.  The next field is
> > > the id, which is where you specify the unique value for the id in your
> > > schema.  Example is we had a document reference in a database, and
> > > that id was 103, so we would specify the value 103 to identify which
> > > document it was in the index.  Stream.fieldname is the name of the
> > > field in your index that will actually be storing the text from the
> > > document.  We had the field 'data' so it would be
> > > stream.fieldname=data in the url.
> > >
> > > The parameter fieldnames is any additional fields in your index that
> > > need to be filled.  We were passing a category, description for the
> > > document, a name, and the type.  So you just need to specify the names
> > > of the fields.  Solr will then look for corresponding parameters with
> > > those names, which you can see at the end of my URL.  The values
> > > passed for the additional parameters need to be sent url encoded.
> > >
> > > I'm not a Java programmer so if you have questions about the internals
> > > of the code, definitely direct those to Eric as I cannot help.  I have
> > > only implemented it in web applications.  If you have any other
> > > questions about the use of the patch I can answer those questions.
> > >
> > > Enjoy!
> > >
> > > - Pete
> > >
> > > On 8/21/07, Vish D. <[EMAIL PROTECTED]> wrote:
> > > > There seems to be some code out for Tika now (not packaged/announced
> > > yet,
> > > > but...). Could someone please take a look at it and see if that
> could
> > > fit
> > > > in? I am eagerly waiting for a reply back from tika-dev, but no luck
> > > yet.
> > > >
> > > >
> > >
> http://svn.apache.org/repos/asf/incubator/tika/trunk/src/main/java/org/apache/tika/
> > > >
> > > > I see that Eric's patch uses POI (for most of it)...so that's great!
> I
> > > have
> > > > seen too many duplicated efforts, even in Apache projects alone, and
> > > this is
> > > > one step close to fixing it (other than Tika, which isnt' 'complete'
> > > yet).
> > > > Are there any plans on releasing this patch with Solr dist? Or, any
> > > > instructions on using/installing the patch itself?
> > > >
> > > > Thanks
> > > > Vish
> > > >
> > > >
> > > > On 8/21/07, Peter Manis <[EMAIL PROTECTED]> wrote:
> > > > >
> > > > > Christian,
> > > > >
> > > > > Eric Pugh created implemented this functionality for a project we
> were
> > > > > doing and has released to code on JIRA.  We have had very good
> results
> > > > > with it.  If I can be of any help using it beyond the Java code
> itself
> > > > > let me know.  The last revision I used with it was 552853, so if
> the
> > > > > build happens to fail you can roll back to that and it will work.
> > > > >
> > > > > https://issues.apache.org/jira/browse/SOLR-284
> > > > >
> > > > > - Pete
> > > > >
> > > > > On 8/21/07, Christian Klinger <[EMAIL PROTECTED]> wrote:
> > > > > > Hi Solr Users,
> > > > > >
> > > > > > i have set up a Solr-Server with a custom Schema.
> > > > > > Now i have updated the index with some content form
> > > > > > xml-files.
> > > > > >
> > > > > > Now i try to update the contents of a folder.
> > > > > > The folder consits of various document-types
> > > > > > (pdf,doc,xls,...).
> > > > > >
> > > > > > Is there anywhere an howto how can i parse the
> > > > > > documents, make an xml of the paresed content
> > > > > > and post it to the solr server?
> > > > > >
> > > > > > Thanks in advance.
> > > > > >
> > > > > > Christian
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Indexing Doc, PDF, ... from filesystem (Newbie Question)

Reply via email to