RE: [HippoCMS-dev] Extractors and indexing

Ard Schrijvers Thu, 20 Dec 2007 05:04:03 -0800

Hello,

> Thank for fast reply and clarification,
> I also would like to ask about indexing without storing.
> We have document like this:
> - title
> - newspaper
> - <url to pdf in /binaries>
> And now we need ability to search documents that have some 
> text in pdf. I want to to this by writing my extractor that 
> will take pdf, extract text and put it in property. As you 
> said before this is the way it should be done.


No, you really shouldn't do this. I think you are confused about the
indexing:

You do not need to put everything in a property to have it indexed!!
Normally, I put things in an index I want to specifically search on
(like, I have a <title> field in my xml, but want to be able to search
on title only, and not on the entire xml. Then I extract title as a
property.)

But, not everything configured in extractors will actually be used to
set a property on a document. I do agree with you that it is a little
confusing:

1) Extractors with an instruction are used to extract a property, which
is set on the document, and indexed according the configuration in
dasl-indexer.xml for this property
2) Extractors without an instruction do not put a property on a
document, but are only used during indexing!

So, for example, if I configure:

<!-- XML content extractor -->
<extractor classname="nl.hippo.slide.extractor.XMLContentExtractor"
uri="/files" content-type="text/xml"/>
  
it means that *all* xml content of xml docs under /files are indexed
(according the global indexer in dasl-indexer.xml, default
nl.hippo.slide.index.analysis.SimpleStandardAnalyzer), without a
property being extracted.

Now, for you pdf / word etc all you need to add is something like:

<extractor classname="org.apache.slide.extractor.MSWordExtractor"
uri="/files/project.preview/binaries"
                 content-type="application/msword"/>

or 

<extractor classname="org.apache.slide.extractor.PDFExtractor"
        uri="/files/project.preview/binaries"
content-type="application/pdf"/>

where project is your realm/workspace see [1].

If you add these extractors, stop the repository, delete the lucene
index, and restart the lucene index will be recreated. Now, when
searching/doing a dasl with 

<d:contains>foo</d:contains>

you will get hits for all xml documents, but also pdf documents
containing 'foo'.

Hope things are a little more clear,

Regards Ard

[1]
http://www.hippocms.org/display/CMS/4.+Hippo+Repository+Configure+Extrac
tors#4.HippoRepositoryConfigureExtractors-org.apache.slide.extractor.MSW
ordExtractor

> But in this case its completely unnesesary to keep pdf's text.
> Is there a way to avoid duplication?
> 
> Darek
> 
> 
> 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>:
> >
> >
> > Hello Darek,
> >
> > > Hello,
> > > I was looking for these information in docs, lists and found 
> > > nothing. If I repeated a problem - then sorry :)
> > >
> > > We have a problem with searching over documents. Lets say 
> we have a 
> > > document that consists of : title, date, abstract.
> > > We need ability to search over these fields separately.
> > > We did that by making extractors that rewrite these fields to 
> > > properties p_title, p_date, p_abstract. Now lucene can 
> index it and 
> > > it works.
> > > But ...
> > > Now we have same content in 2 places.
> > > Is there a better way to do this?
> >
> > In principle, this is the way to do it. For a title and a 
> date, it is 
> > pretty normal and straightforward. For the abstract you 
> might not want 
> > to duplicate the entire text. For the abstract you might also work 
> > with ConfigurableXMLContentExtractor [1]. Then in your search/dasl, 
> > you could say something like:
> >
> > <d:contains locale="abstract"> your query </>
> >
> > As 'locale' already indicates, it is actually implemented for 
> > different languages within one xml file, so you would 
> misuse it a little.
> >
> > OTOH, you might just keep working with your current 
> approach without 
> > real problems. Make sure, that for the abstract, you configure the 
> > property in dasl-indexer.xml to be of type="text" (and use 
> > property-contains in your dasl instead op propcontains, see 
> [2]). For 
> > date and title you might want to choose to not do this
> >
> > -Ard
> >
> > [1]
> > 
> http://www.hippocms.org/display/CMS/Hippo+Repository+ConfigurableXMLCo
> > nt
> > entExtractor
> > [2] http://www.hippocms.org/display/CMS/06.+Using+DASL+Queries
> >
> > >
> > > Second question.
> > > Is it possible to index (for searching) something without storing 
> > > its content? Just like in lucene:
> > > Field.Index = true
> > > Field.Store = false
> > >
> > > Regards,
> > > Darek
> > > ********************************************
> > > Hippocms-dev: Hippo CMS development public mailinglist
> > >
> > ********************************************
> > Hippocms-dev: Hippo CMS development public mailinglist
> >
> ********************************************
> Hippocms-dev: Hippo CMS development public mailinglist
> 
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

RE: [HippoCMS-dev] Extractors and indexing

Reply via email to