Re: [HippoCMS-dev] Extractors and indexing

Darek Thu, 20 Dec 2007 07:30:12 -0800

Unfortunately problem is more complicated,
As I said before we have document (cms type):
- title (simple text input)
- newspaper (simple text input)
- <url to pdf in /binaries>
So stored xml in repository looks like this (type='media'):
<?xml version="1.0" encoding="UTF-8"?>
<root>
  <title>aa</title>
  <newspaper>bb</newspaper>
  <media>/binaries/en/acceptance-tdd.pdf</media>
  <date>2007-12-06T00:00:00.000Z</date>
</root>


Now. In portal I would like to have ability to sort and search in media type
by: title, newspaper, date (we can do it by creating properties from these
fields). Effect would be the list of found media types.

But I would also like to search in media/media. So when someone types: 'tdd'
and hits button search in pdf I can give him list of media types that
contain pdf that contain phrase 'tdd'. Effect have to be the same: list of
found media types. In this case:
title | newspaper | getPdf
aa   |  bb            | link(getPdf)

So the problem is not how to index pdf file but how to connect its content
with media type.


Again thank you for answers,
Darek


2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>:
>
>
> Hello,
>
> > Thank for fast reply and clarification,
> > I also would like to ask about indexing without storing.
> > We have document like this:
> > - title
> > - newspaper
> > - <url to pdf in /binaries>
> > And now we need ability to search documents that have some
> > text in pdf. I want to to this by writing my extractor that
> > will take pdf, extract text and put it in property. As you
> > said before this is the way it should be done.
>
> No, you really shouldn't do this. I think you are confused about the
> indexing:
>
> You do not need to put everything in a property to have it indexed!!
> Normally, I put things in an index I want to specifically search on
> (like, I have a <title> field in my xml, but want to be able to search
> on title only, and not on the entire xml. Then I extract title as a
> property.)
>
> But, not everything configured in extractors will actually be used to
> set a property on a document. I do agree with you that it is a little
> confusing:
>
> 1) Extractors with an instruction are used to extract a property, which
> is set on the document, and indexed according the configuration in
> dasl-indexer.xml for this property
> 2) Extractors without an instruction do not put a property on a
> document, but are only used during indexing!
>
> So, for example, if I configure:
>
> <!-- XML content extractor -->
> <extractor classname="nl.hippo.slide.extractor.XMLContentExtractor"
> uri="/files" content-type="text/xml"/>
>
> it means that *all* xml content of xml docs under /files are indexed
> (according the global indexer in dasl-indexer.xml, default
> nl.hippo.slide.index.analysis.SimpleStandardAnalyzer), without a
> property being extracted.
>
> Now, for you pdf / word etc all you need to add is something like:
>
> <extractor classname="org.apache.slide.extractor.MSWordExtractor"
> uri="/files/project.preview/binaries"
>                  content-type="application/msword"/>
>
> or
>
> <extractor classname="org.apache.slide.extractor.PDFExtractor"
>         uri="/files/project.preview/binaries"
> content-type="application/pdf"/>
>
> where project is your realm/workspace see [1].
>
> If you add these extractors, stop the repository, delete the lucene
> index, and restart the lucene index will be recreated. Now, when
> searching/doing a dasl with
>
> <d:contains>foo</d:contains>
>
> you will get hits for all xml documents, but also pdf documents
> containing 'foo'.
>
> Hope things are a little more clear,
>
> Regards Ard
>
> [1]
> http://www.hippocms.org/display/CMS/4.+Hippo+Repository+Configure+Extrac
> tors#4.HippoRepositoryConfigureExtractors-org.apache.slide.extractor.MSW
> ordExtractor
>
> > But in this case its completely unnesesary to keep pdf's text.
> > Is there a way to avoid duplication?
> >
> > Darek
> >
> >
> > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>:
> > >
> > >
> > > Hello Darek,
> > >
> > > > Hello,
> > > > I was looking for these information in docs, lists and found
> > > > nothing. If I repeated a problem - then sorry :)
> > > >
> > > > We have a problem with searching over documents. Lets say
> > we have a
> > > > document that consists of : title, date, abstract.
> > > > We need ability to search over these fields separately.
> > > > We did that by making extractors that rewrite these fields to
> > > > properties p_title, p_date, p_abstract. Now lucene can
> > index it and
> > > > it works.
> > > > But ...
> > > > Now we have same content in 2 places.
> > > > Is there a better way to do this?
> > >
> > > In principle, this is the way to do it. For a title and a
> > date, it is
> > > pretty normal and straightforward. For the abstract you
> > might not want
> > > to duplicate the entire text. For the abstract you might also work
> > > with ConfigurableXMLContentExtractor [1]. Then in your search/dasl,
> > > you could say something like:
> > >
> > > <d:contains locale="abstract"> your query </>
> > >
> > > As 'locale' already indicates, it is actually implemented for
> > > different languages within one xml file, so you would
> > misuse it a little.
> > >
> > > OTOH, you might just keep working with your current
> > approach without
> > > real problems. Make sure, that for the abstract, you configure the
> > > property in dasl-indexer.xml to be of type="text" (and use
> > > property-contains in your dasl instead op propcontains, see
> > [2]). For
> > > date and title you might want to choose to not do this
> > >
> > > -Ard
> > >
> > > [1]
> > >
> > http://www.hippocms.org/display/CMS/Hippo+Repository+ConfigurableXMLCo
> > > nt
> > > entExtractor
> > > [2] http://www.hippocms.org/display/CMS/06.+Using+DASL+Queries
> > >
> > > >
> > > > Second question.
> > > > Is it possible to index (for searching) something without storing
> > > > its content? Just like in lucene:
> > > > Field.Index = true
> > > > Field.Store = false
> > > >
> > > > Regards,
> > > > Darek
> > > > ********************************************
> > > > Hippocms-dev: Hippo CMS development public mailinglist
> > > >
> > > ********************************************
> > > Hippocms-dev: Hippo CMS development public mailinglist
> > >
> > ********************************************
> > Hippocms-dev: Hippo CMS development public mailinglist
> >
> ********************************************
> Hippocms-dev: Hippo CMS development public mailinglist
>
********************************************
Hippocms-dev: Hippo CMS development public mailinglist

Re: [HippoCMS-dev] Extractors and indexing

Reply via email to