RE: [HippoCMS-dev] Extractors and indexing

Ard Schrijvers Thu, 20 Dec 2007 07:50:06 -0800

Hello,

> 
> Unfortunately problem is more complicated, As I said before 
> we have document (cms type):
> - title (simple text input)
> - newspaper (simple text input)
> - <url to pdf in /binaries>
> So stored xml in repository looks like this (type='media'):
> <?xml version="1.0" encoding="UTF-8"?>
> <root>
>   <title>aa</title>
>   <newspaper>bb</newspaper>
>   <media>/binaries/en/acceptance-tdd.pdf</media>
>   <date>2007-12-06T00:00:00.000Z</date>
> </root>
> 
> Now. In portal I would like to have ability to sort and 
> search in media type
> by: title, newspaper, date (we can do it by creating 
> properties from these fields). Effect would be the list of 
> found media types.
> 
> But I would also like to search in media/media. So when 
> someone types: 'tdd'
> and hits button search in pdf I can give him list of media 
> types that contain pdf that contain phrase 'tdd'. Effect have 
> to be the same: list of found media types. In this case:
> title | newspaper | getPdf
> aa   |  bb            | link(getPdf)
> 
> So the problem is not how to index pdf file but how to 
> connect its content with media type.


Yes, I understand your problem. IMO, you have two options, one you would
have to build something, the other you have to do some frontend and
indexing tricks:

1) You can create your own XMLContentExtractor, and when it encounters a
link to a binary, import that binary and index it along with the
document content. This will be feasible, not really hard, but also not
trivial. OTOH, you will end up with a situation, that when the pdf gets
deleted/changed, the indexed document (which indexed the pdf as well)
won't be aware of it. You might again add some logic, that check the
repository for all docs having a link to some pdf, and when a pdf is
changed, all docs using that pdf are re-indexed. But, you need some
engineering for it

2) This is the easiest way but requires more frontend things (and a big
disadvantage, it needs two searches): What you need to do, is

- extract all binary link with a
nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor (this extracts
all links comma seperated in one property)

- add to the dasl-indexer.xml this property, and configure it to be
type="text" and analyzer LowercaseCommaSeparatedAnalyzer, for example:
<property
analyzer="nl.hippo.slide.index.analysis.LowercaseCommaSeparatedAnalyzer"
name="references" namespace="http://hippo.nl/cms/1.0"; type="text"/>

- if you now do a search in pdf, you get some results (repository
locations in href attr)

- With these results, you construct a new dasl that says: give me all
documents having one or more links to one of these results. The dasl
that does that is (target is just /files/project.preview / live)

<d:where>
        <d:or>
                <d:propcontains>
                        <d:prop>references</d:prop>
                        <d:literal>firstfound-binary-link</d:literal>
                </d:propcontains>
              <d:propcontains>
                        <d:prop>references</d:prop>
                        <d:literal>secondfound-binary-link</d:literal>
                </d:propcontains>
              <d:propcontains>
                        <d:prop>references</d:prop>
                        <d:literal>thirdfound-binary-link</d:literal>
                </d:propcontains>
                etc etc
        </d:or>
</d:where>

Obviously, solution 1 is nicer in the long run (and if you are capable
of building it (depends on how muich time you have) you certainly may
send a patch :-)) but is also much harder! 

Hope this helps,

Regards Ard 

ps 

3) Option three is by the way your previously suggested putting the
string contents of the pdf as a property on the document, but as you
indicated, it is quite some code duplication, and I would discourage you
to add such 'large' properties on documents (depending your db settings
i am not sure wether there are limits)

> 
> 
> Again thank you for answers,
> Darek
> 
> 
> 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>:
> >
> >
> > Hello,
> >
> > > Thank for fast reply and clarification, I also would like to ask 
> > > about indexing without storing.
> > > We have document like this:
> > > - title
> > > - newspaper
> > > - <url to pdf in /binaries>
> > > And now we need ability to search documents that have 
> some text in 
> > > pdf. I want to to this by writing my extractor that will 
> take pdf, 
> > > extract text and put it in property. As you said before 
> this is the 
> > > way it should be done.
> >
> > No, you really shouldn't do this. I think you are confused about the
> > indexing:
> >
> > You do not need to put everything in a property to have it indexed!!
> > Normally, I put things in an index I want to specifically search on 
> > (like, I have a <title> field in my xml, but want to be 
> able to search 
> > on title only, and not on the entire xml. Then I extract title as a
> > property.)
> >
> > But, not everything configured in extractors will actually 
> be used to 
> > set a property on a document. I do agree with you that it 
> is a little
> > confusing:
> >
> > 1) Extractors with an instruction are used to extract a property, 
> > which is set on the document, and indexed according the 
> configuration 
> > in dasl-indexer.xml for this property
> > 2) Extractors without an instruction do not put a property on a 
> > document, but are only used during indexing!
> >
> > So, for example, if I configure:
> >
> > <!-- XML content extractor -->
> > <extractor classname="nl.hippo.slide.extractor.XMLContentExtractor"
> > uri="/files" content-type="text/xml"/>
> >
> > it means that *all* xml content of xml docs under /files 
> are indexed 
> > (according the global indexer in dasl-indexer.xml, default 
> > nl.hippo.slide.index.analysis.SimpleStandardAnalyzer), without a 
> > property being extracted.
> >
> > Now, for you pdf / word etc all you need to add is something like:
> >
> > <extractor classname="org.apache.slide.extractor.MSWordExtractor"
> > uri="/files/project.preview/binaries"
> >                  content-type="application/msword"/>
> >
> > or
> >
> > <extractor classname="org.apache.slide.extractor.PDFExtractor"
> >         uri="/files/project.preview/binaries"
> > content-type="application/pdf"/>
> >
> > where project is your realm/workspace see [1].
> >
> > If you add these extractors, stop the repository, delete the lucene 
> > index, and restart the lucene index will be recreated. Now, when 
> > searching/doing a dasl with
> >
> > <d:contains>foo</d:contains>
> >
> > you will get hits for all xml documents, but also pdf documents 
> > containing 'foo'.
> >
> > Hope things are a little more clear,
> >
> > Regards Ard
> >
> > [1]
> > 
> http://www.hippocms.org/display/CMS/4.+Hippo+Repository+Configure+Extr
> > ac 
> > 
> tors#4.HippoRepositoryConfigureExtractors-org.apache.slide.extractor.M
> > SW
> > ordExtractor
> >
> > > But in this case its completely unnesesary to keep pdf's text.
> > > Is there a way to avoid duplication?
> > >
> > > Darek
> > >
> > >
> > > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>:
> > > >
> > > >
> > > > Hello Darek,
> > > >
> > > > > Hello,
> > > > > I was looking for these information in docs, lists and found 
> > > > > nothing. If I repeated a problem - then sorry :)
> > > > >
> > > > > We have a problem with searching over documents. Lets say
> > > we have a
> > > > > document that consists of : title, date, abstract.
> > > > > We need ability to search over these fields separately.
> > > > > We did that by making extractors that rewrite these fields to 
> > > > > properties p_title, p_date, p_abstract. Now lucene can
> > > index it and
> > > > > it works.
> > > > > But ...
> > > > > Now we have same content in 2 places.
> > > > > Is there a better way to do this?
> > > >
> > > > In principle, this is the way to do it. For a title and a
> > > date, it is
> > > > pretty normal and straightforward. For the abstract you
> > > might not want
> > > > to duplicate the entire text. For the abstract you 
> might also work 
> > > > with ConfigurableXMLContentExtractor [1]. Then in your 
> > > > search/dasl, you could say something like:
> > > >
> > > > <d:contains locale="abstract"> your query </>
> > > >
> > > > As 'locale' already indicates, it is actually implemented for 
> > > > different languages within one xml file, so you would
> > > misuse it a little.
> > > >
> > > > OTOH, you might just keep working with your current
> > > approach without
> > > > real problems. Make sure, that for the abstract, you 
> configure the 
> > > > property in dasl-indexer.xml to be of type="text" (and use 
> > > > property-contains in your dasl instead op propcontains, see
> > > [2]). For
> > > > date and title you might want to choose to not do this
> > > >
> > > > -Ard
> > > >
> > > > [1]
> > > >
> > > 
> http://www.hippocms.org/display/CMS/Hippo+Repository+ConfigurableXML
> > > Co
> > > > nt
> > > > entExtractor
> > > > [2] http://www.hippocms.org/display/CMS/06.+Using+DASL+Queries
> > > >
> > > > >
> > > > > Second question.
> > > > > Is it possible to index (for searching) something without 
> > > > > storing its content? Just like in lucene:
> > > > > Field.Index = true
> > > > > Field.Store = false
> > > > >
> > > > > Regards,
> > > > > Darek
> > > > > ********************************************
> > > > > Hippocms-dev: Hippo CMS development public mailinglist
> > > > >
> > > > ********************************************
> > > > Hippocms-dev: Hippo CMS development public mailinglist
> > > >
> > > ********************************************
> > > Hippocms-dev: Hippo CMS development public mailinglist
> > >
> > ********************************************
> > Hippocms-dev: Hippo CMS development public mailinglist
> >
> ********************************************
> Hippocms-dev: Hippo CMS development public mailinglist
> 
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

RE: [HippoCMS-dev] Extractors and indexing

Reply via email to