Unfortunately problem is more complicated, As I said before we have document (cms type): - title (simple text input) - newspaper (simple text input) - <url to pdf in /binaries> So stored xml in repository looks like this (type='media'): <?xml version="1.0" encoding="UTF-8"?> <root> <title>aa</title> <newspaper>bb</newspaper> <media>/binaries/en/acceptance-tdd.pdf</media> <date>2007-12-06T00:00:00.000Z</date> </root>
Now. In portal I would like to have ability to sort and search in media type by: title, newspaper, date (we can do it by creating properties from these fields). Effect would be the list of found media types. But I would also like to search in media/media. So when someone types: 'tdd' and hits button search in pdf I can give him list of media types that contain pdf that contain phrase 'tdd'. Effect have to be the same: list of found media types. In this case: title | newspaper | getPdf aa | bb | link(getPdf) So the problem is not how to index pdf file but how to connect its content with media type. Again thank you for answers, Darek 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>: > > > Hello, > > > Thank for fast reply and clarification, > > I also would like to ask about indexing without storing. > > We have document like this: > > - title > > - newspaper > > - <url to pdf in /binaries> > > And now we need ability to search documents that have some > > text in pdf. I want to to this by writing my extractor that > > will take pdf, extract text and put it in property. As you > > said before this is the way it should be done. > > No, you really shouldn't do this. I think you are confused about the > indexing: > > You do not need to put everything in a property to have it indexed!! > Normally, I put things in an index I want to specifically search on > (like, I have a <title> field in my xml, but want to be able to search > on title only, and not on the entire xml. Then I extract title as a > property.) > > But, not everything configured in extractors will actually be used to > set a property on a document. I do agree with you that it is a little > confusing: > > 1) Extractors with an instruction are used to extract a property, which > is set on the document, and indexed according the configuration in > dasl-indexer.xml for this property > 2) Extractors without an instruction do not put a property on a > document, but are only used during indexing! > > So, for example, if I configure: > > <!-- XML content extractor --> > <extractor classname="nl.hippo.slide.extractor.XMLContentExtractor" > uri="/files" content-type="text/xml"/> > > it means that *all* xml content of xml docs under /files are indexed > (according the global indexer in dasl-indexer.xml, default > nl.hippo.slide.index.analysis.SimpleStandardAnalyzer), without a > property being extracted. > > Now, for you pdf / word etc all you need to add is something like: > > <extractor classname="org.apache.slide.extractor.MSWordExtractor" > uri="/files/project.preview/binaries" > content-type="application/msword"/> > > or > > <extractor classname="org.apache.slide.extractor.PDFExtractor" > uri="/files/project.preview/binaries" > content-type="application/pdf"/> > > where project is your realm/workspace see [1]. > > If you add these extractors, stop the repository, delete the lucene > index, and restart the lucene index will be recreated. Now, when > searching/doing a dasl with > > <d:contains>foo</d:contains> > > you will get hits for all xml documents, but also pdf documents > containing 'foo'. > > Hope things are a little more clear, > > Regards Ard > > [1] > http://www.hippocms.org/display/CMS/4.+Hippo+Repository+Configure+Extrac > tors#4.HippoRepositoryConfigureExtractors-org.apache.slide.extractor.MSW > ordExtractor > > > But in this case its completely unnesesary to keep pdf's text. > > Is there a way to avoid duplication? > > > > Darek > > > > > > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>: > > > > > > > > > Hello Darek, > > > > > > > Hello, > > > > I was looking for these information in docs, lists and found > > > > nothing. If I repeated a problem - then sorry :) > > > > > > > > We have a problem with searching over documents. Lets say > > we have a > > > > document that consists of : title, date, abstract. > > > > We need ability to search over these fields separately. > > > > We did that by making extractors that rewrite these fields to > > > > properties p_title, p_date, p_abstract. Now lucene can > > index it and > > > > it works. > > > > But ... > > > > Now we have same content in 2 places. > > > > Is there a better way to do this? > > > > > > In principle, this is the way to do it. For a title and a > > date, it is > > > pretty normal and straightforward. For the abstract you > > might not want > > > to duplicate the entire text. For the abstract you might also work > > > with ConfigurableXMLContentExtractor [1]. Then in your search/dasl, > > > you could say something like: > > > > > > <d:contains locale="abstract"> your query </> > > > > > > As 'locale' already indicates, it is actually implemented for > > > different languages within one xml file, so you would > > misuse it a little. > > > > > > OTOH, you might just keep working with your current > > approach without > > > real problems. Make sure, that for the abstract, you configure the > > > property in dasl-indexer.xml to be of type="text" (and use > > > property-contains in your dasl instead op propcontains, see > > [2]). For > > > date and title you might want to choose to not do this > > > > > > -Ard > > > > > > [1] > > > > > http://www.hippocms.org/display/CMS/Hippo+Repository+ConfigurableXMLCo > > > nt > > > entExtractor > > > [2] http://www.hippocms.org/display/CMS/06.+Using+DASL+Queries > > > > > > > > > > > Second question. > > > > Is it possible to index (for searching) something without storing > > > > its content? Just like in lucene: > > > > Field.Index = true > > > > Field.Store = false > > > > > > > > Regards, > > > > Darek > > > > ******************************************** > > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > > > ******************************************** > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > ******************************************** > > Hippocms-dev: Hippo CMS development public mailinglist > > > ******************************************** > Hippocms-dev: Hippo CMS development public mailinglist > ******************************************** Hippocms-dev: Hippo CMS development public mailinglist
