Re: [HippoCMS-dev] Extractors and indexing

Darek Thu, 20 Dec 2007 08:17:36 -0800

Thanks again for your answer. Yes - it helps me a lot.
Right now I was just playing with something like solution 2, but I must say
i like first one. I have to give it a try, and we'll see what comes up :)



Regards,
Darek



2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>:
>
> Hello,
>
> >
> > Unfortunately problem is more complicated, As I said before
> > we have document (cms type):
> > - title (simple text input)
> > - newspaper (simple text input)
> > - <url to pdf in /binaries>
> > So stored xml in repository looks like this (type='media'):
> > <?xml version="1.0" encoding="UTF-8"?>
> > <root>
> >   <title>aa</title>
> >   <newspaper>bb</newspaper>
> >   <media>/binaries/en/acceptance-tdd.pdf</media>
> >   <date>2007-12-06T00:00:00.000Z</date>
> > </root>
> >
> > Now. In portal I would like to have ability to sort and
> > search in media type
> > by: title, newspaper, date (we can do it by creating
> > properties from these fields). Effect would be the list of
> > found media types.
> >
> > But I would also like to search in media/media. So when
> > someone types: 'tdd'
> > and hits button search in pdf I can give him list of media
> > types that contain pdf that contain phrase 'tdd'. Effect have
> > to be the same: list of found media types. In this case:
> > title | newspaper | getPdf
> > aa   |  bb            | link(getPdf)
> >
> > So the problem is not how to index pdf file but how to
> > connect its content with media type.
>
> Yes, I understand your problem. IMO, you have two options, one you would
> have to build something, the other you have to do some frontend and
> indexing tricks:
>
> 1) You can create your own XMLContentExtractor, and when it encounters a
> link to a binary, import that binary and index it along with the
> document content. This will be feasible, not really hard, but also not
> trivial. OTOH, you will end up with a situation, that when the pdf gets
> deleted/changed, the indexed document (which indexed the pdf as well)
> won't be aware of it. You might again add some logic, that check the
> repository for all docs having a link to some pdf, and when a pdf is
> changed, all docs using that pdf are re-indexed. But, you need some
> engineering for it
>
> 2) This is the easiest way but requires more frontend things (and a big
> disadvantage, it needs two searches): What you need to do, is
>
> - extract all binary link with a
> nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor (this extracts
> all links comma seperated in one property)
>
> - add to the dasl-indexer.xml this property, and configure it to be
> type="text" and analyzer LowercaseCommaSeparatedAnalyzer, for example:
> <property
> analyzer="nl.hippo.slide.index.analysis.LowercaseCommaSeparatedAnalyzer"
> name="references" namespace="http://hippo.nl/cms/1.0"; type="text"/>
>
> - if you now do a search in pdf, you get some results (repository
> locations in href attr)
>
> - With these results, you construct a new dasl that says: give me all
> documents having one or more links to one of these results. The dasl
> that does that is (target is just /files/project.preview / live)
>
> <d:where>
>         <d:or>
>                 <d:propcontains>
>                         <d:prop>references</d:prop>
>                         <d:literal>firstfound-binary-link</d:literal>
>                 </d:propcontains>
>               <d:propcontains>
>                         <d:prop>references</d:prop>
>                         <d:literal>secondfound-binary-link</d:literal>
>                 </d:propcontains>
>               <d:propcontains>
>                         <d:prop>references</d:prop>
>                         <d:literal>thirdfound-binary-link</d:literal>
>                 </d:propcontains>
>                 etc etc
>         </d:or>
> </d:where>
>
> Obviously, solution 1 is nicer in the long run (and if you are capable
> of building it (depends on how muich time you have) you certainly may
> send a patch :-)) but is also much harder!
>
> Hope this helps,
>
> Regards Ard
>
> ps
>
> 3) Option three is by the way your previously suggested putting the
> string contents of the pdf as a property on the document, but as you
> indicated, it is quite some code duplication, and I would discourage you
> to add such 'large' properties on documents (depending your db settings
> i am not sure wether there are limits)
>
> >
> >
> > Again thank you for answers,
> > Darek
> >
> >
> > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>:
> > >
> > >
> > > Hello,
> > >
> > > > Thank for fast reply and clarification, I also would like to ask
> > > > about indexing without storing.
> > > > We have document like this:
> > > > - title
> > > > - newspaper
> > > > - <url to pdf in /binaries>
> > > > And now we need ability to search documents that have
> > some text in
> > > > pdf. I want to to this by writing my extractor that will
> > take pdf,
> > > > extract text and put it in property. As you said before
> > this is the
> > > > way it should be done.
> > >
> > > No, you really shouldn't do this. I think you are confused about the
> > > indexing:
> > >
> > > You do not need to put everything in a property to have it indexed!!
> > > Normally, I put things in an index I want to specifically search on
> > > (like, I have a <title> field in my xml, but want to be
> > able to search
> > > on title only, and not on the entire xml. Then I extract title as a
> > > property.)
> > >
> > > But, not everything configured in extractors will actually
> > be used to
> > > set a property on a document. I do agree with you that it
> > is a little
> > > confusing:
> > >
> > > 1) Extractors with an instruction are used to extract a property,
> > > which is set on the document, and indexed according the
> > configuration
> > > in dasl-indexer.xml for this property
> > > 2) Extractors without an instruction do not put a property on a
> > > document, but are only used during indexing!
> > >
> > > So, for example, if I configure:
> > >
> > > <!-- XML content extractor -->
> > > <extractor classname="nl.hippo.slide.extractor.XMLContentExtractor"
> > > uri="/files" content-type="text/xml"/>
> > >
> > > it means that *all* xml content of xml docs under /files
> > are indexed
> > > (according the global indexer in dasl-indexer.xml, default
> > > nl.hippo.slide.index.analysis.SimpleStandardAnalyzer), without a
> > > property being extracted.
> > >
> > > Now, for you pdf / word etc all you need to add is something like:
> > >
> > > <extractor classname="org.apache.slide.extractor.MSWordExtractor"
> > > uri="/files/project.preview/binaries"
> > >                  content-type="application/msword"/>
> > >
> > > or
> > >
> > > <extractor classname="org.apache.slide.extractor.PDFExtractor"
> > >         uri="/files/project.preview/binaries"
> > > content-type="application/pdf"/>
> > >
> > > where project is your realm/workspace see [1].
> > >
> > > If you add these extractors, stop the repository, delete the lucene
> > > index, and restart the lucene index will be recreated. Now, when
> > > searching/doing a dasl with
> > >
> > > <d:contains>foo</d:contains>
> > >
> > > you will get hits for all xml documents, but also pdf documents
> > > containing 'foo'.
> > >
> > > Hope things are a little more clear,
> > >
> > > Regards Ard
> > >
> > > [1]
> > >
> > http://www.hippocms.org/display/CMS/4.+Hippo+Repository+Configure+Extr
> > > ac
> > >
> > tors#4.HippoRepositoryConfigureExtractors-org.apache.slide.extractor.M
> > > SW
> > > ordExtractor
> > >
> > > > But in this case its completely unnesesary to keep pdf's text.
> > > > Is there a way to avoid duplication?
> > > >
> > > > Darek
> > > >
> > > >
> > > > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>:
> > > > >
> > > > >
> > > > > Hello Darek,
> > > > >
> > > > > > Hello,
> > > > > > I was looking for these information in docs, lists and found
> > > > > > nothing. If I repeated a problem - then sorry :)
> > > > > >
> > > > > > We have a problem with searching over documents. Lets say
> > > > we have a
> > > > > > document that consists of : title, date, abstract.
> > > > > > We need ability to search over these fields separately.
> > > > > > We did that by making extractors that rewrite these fields to
> > > > > > properties p_title, p_date, p_abstract. Now lucene can
> > > > index it and
> > > > > > it works.
> > > > > > But ...
> > > > > > Now we have same content in 2 places.
> > > > > > Is there a better way to do this?
> > > > >
> > > > > In principle, this is the way to do it. For a title and a
> > > > date, it is
> > > > > pretty normal and straightforward. For the abstract you
> > > > might not want
> > > > > to duplicate the entire text. For the abstract you
> > might also work
> > > > > with ConfigurableXMLContentExtractor [1]. Then in your
> > > > > search/dasl, you could say something like:
> > > > >
> > > > > <d:contains locale="abstract"> your query </>
> > > > >
> > > > > As 'locale' already indicates, it is actually implemented for
> > > > > different languages within one xml file, so you would
> > > > misuse it a little.
> > > > >
> > > > > OTOH, you might just keep working with your current
> > > > approach without
> > > > > real problems. Make sure, that for the abstract, you
> > configure the
> > > > > property in dasl-indexer.xml to be of type="text" (and use
> > > > > property-contains in your dasl instead op propcontains, see
> > > > [2]). For
> > > > > date and title you might want to choose to not do this
> > > > >
> > > > > -Ard
> > > > >
> > > > > [1]
> > > > >
> > > >
> > http://www.hippocms.org/display/CMS/Hippo+Repository+ConfigurableXML
> > > > Co
> > > > > nt
> > > > > entExtractor
> > > > > [2] http://www.hippocms.org/display/CMS/06.+Using+DASL+Queries
> > > > >
> > > > > >
> > > > > > Second question.
> > > > > > Is it possible to index (for searching) something without
> > > > > > storing its content? Just like in lucene:
> > > > > > Field.Index = true
> > > > > > Field.Store = false
> > > > > >
> > > > > > Regards,
> > > > > > Darek
> > > > > > ********************************************
> > > > > > Hippocms-dev: Hippo CMS development public mailinglist
> > > > > >
> > > > > ********************************************
> > > > > Hippocms-dev: Hippo CMS development public mailinglist
> > > > >
> > > > ********************************************
> > > > Hippocms-dev: Hippo CMS development public mailinglist
> > > >
> > > ********************************************
> > > Hippocms-dev: Hippo CMS development public mailinglist
> > >
> > ********************************************
> > Hippocms-dev: Hippo CMS development public mailinglist
> >
> ********************************************
> Hippocms-dev: Hippo CMS development public mailinglist
>
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Re: [HippoCMS-dev] Extractors and indexing

Reply via email to