Hello, > > Unfortunately problem is more complicated, As I said before > we have document (cms type): > - title (simple text input) > - newspaper (simple text input) > - <url to pdf in /binaries> > So stored xml in repository looks like this (type='media'): > <?xml version="1.0" encoding="UTF-8"?> > <root> > <title>aa</title> > <newspaper>bb</newspaper> > <media>/binaries/en/acceptance-tdd.pdf</media> > <date>2007-12-06T00:00:00.000Z</date> > </root> > > Now. In portal I would like to have ability to sort and > search in media type > by: title, newspaper, date (we can do it by creating > properties from these fields). Effect would be the list of > found media types. > > But I would also like to search in media/media. So when > someone types: 'tdd' > and hits button search in pdf I can give him list of media > types that contain pdf that contain phrase 'tdd'. Effect have > to be the same: list of found media types. In this case: > title | newspaper | getPdf > aa | bb | link(getPdf) > > So the problem is not how to index pdf file but how to > connect its content with media type.
Yes, I understand your problem. IMO, you have two options, one you would have to build something, the other you have to do some frontend and indexing tricks: 1) You can create your own XMLContentExtractor, and when it encounters a link to a binary, import that binary and index it along with the document content. This will be feasible, not really hard, but also not trivial. OTOH, you will end up with a situation, that when the pdf gets deleted/changed, the indexed document (which indexed the pdf as well) won't be aware of it. You might again add some logic, that check the repository for all docs having a link to some pdf, and when a pdf is changed, all docs using that pdf are re-indexed. But, you need some engineering for it 2) This is the easiest way but requires more frontend things (and a big disadvantage, it needs two searches): What you need to do, is - extract all binary link with a nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor (this extracts all links comma seperated in one property) - add to the dasl-indexer.xml this property, and configure it to be type="text" and analyzer LowercaseCommaSeparatedAnalyzer, for example: <property analyzer="nl.hippo.slide.index.analysis.LowercaseCommaSeparatedAnalyzer" name="references" namespace="http://hippo.nl/cms/1.0" type="text"/> - if you now do a search in pdf, you get some results (repository locations in href attr) - With these results, you construct a new dasl that says: give me all documents having one or more links to one of these results. The dasl that does that is (target is just /files/project.preview / live) <d:where> <d:or> <d:propcontains> <d:prop>references</d:prop> <d:literal>firstfound-binary-link</d:literal> </d:propcontains> <d:propcontains> <d:prop>references</d:prop> <d:literal>secondfound-binary-link</d:literal> </d:propcontains> <d:propcontains> <d:prop>references</d:prop> <d:literal>thirdfound-binary-link</d:literal> </d:propcontains> etc etc </d:or> </d:where> Obviously, solution 1 is nicer in the long run (and if you are capable of building it (depends on how muich time you have) you certainly may send a patch :-)) but is also much harder! Hope this helps, Regards Ard ps 3) Option three is by the way your previously suggested putting the string contents of the pdf as a property on the document, but as you indicated, it is quite some code duplication, and I would discourage you to add such 'large' properties on documents (depending your db settings i am not sure wether there are limits) > > > Again thank you for answers, > Darek > > > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>: > > > > > > Hello, > > > > > Thank for fast reply and clarification, I also would like to ask > > > about indexing without storing. > > > We have document like this: > > > - title > > > - newspaper > > > - <url to pdf in /binaries> > > > And now we need ability to search documents that have > some text in > > > pdf. I want to to this by writing my extractor that will > take pdf, > > > extract text and put it in property. As you said before > this is the > > > way it should be done. > > > > No, you really shouldn't do this. I think you are confused about the > > indexing: > > > > You do not need to put everything in a property to have it indexed!! > > Normally, I put things in an index I want to specifically search on > > (like, I have a <title> field in my xml, but want to be > able to search > > on title only, and not on the entire xml. Then I extract title as a > > property.) > > > > But, not everything configured in extractors will actually > be used to > > set a property on a document. I do agree with you that it > is a little > > confusing: > > > > 1) Extractors with an instruction are used to extract a property, > > which is set on the document, and indexed according the > configuration > > in dasl-indexer.xml for this property > > 2) Extractors without an instruction do not put a property on a > > document, but are only used during indexing! > > > > So, for example, if I configure: > > > > <!-- XML content extractor --> > > <extractor classname="nl.hippo.slide.extractor.XMLContentExtractor" > > uri="/files" content-type="text/xml"/> > > > > it means that *all* xml content of xml docs under /files > are indexed > > (according the global indexer in dasl-indexer.xml, default > > nl.hippo.slide.index.analysis.SimpleStandardAnalyzer), without a > > property being extracted. > > > > Now, for you pdf / word etc all you need to add is something like: > > > > <extractor classname="org.apache.slide.extractor.MSWordExtractor" > > uri="/files/project.preview/binaries" > > content-type="application/msword"/> > > > > or > > > > <extractor classname="org.apache.slide.extractor.PDFExtractor" > > uri="/files/project.preview/binaries" > > content-type="application/pdf"/> > > > > where project is your realm/workspace see [1]. > > > > If you add these extractors, stop the repository, delete the lucene > > index, and restart the lucene index will be recreated. Now, when > > searching/doing a dasl with > > > > <d:contains>foo</d:contains> > > > > you will get hits for all xml documents, but also pdf documents > > containing 'foo'. > > > > Hope things are a little more clear, > > > > Regards Ard > > > > [1] > > > http://www.hippocms.org/display/CMS/4.+Hippo+Repository+Configure+Extr > > ac > > > tors#4.HippoRepositoryConfigureExtractors-org.apache.slide.extractor.M > > SW > > ordExtractor > > > > > But in this case its completely unnesesary to keep pdf's text. > > > Is there a way to avoid duplication? > > > > > > Darek > > > > > > > > > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>: > > > > > > > > > > > > Hello Darek, > > > > > > > > > Hello, > > > > > I was looking for these information in docs, lists and found > > > > > nothing. If I repeated a problem - then sorry :) > > > > > > > > > > We have a problem with searching over documents. Lets say > > > we have a > > > > > document that consists of : title, date, abstract. > > > > > We need ability to search over these fields separately. > > > > > We did that by making extractors that rewrite these fields to > > > > > properties p_title, p_date, p_abstract. Now lucene can > > > index it and > > > > > it works. > > > > > But ... > > > > > Now we have same content in 2 places. > > > > > Is there a better way to do this? > > > > > > > > In principle, this is the way to do it. For a title and a > > > date, it is > > > > pretty normal and straightforward. For the abstract you > > > might not want > > > > to duplicate the entire text. For the abstract you > might also work > > > > with ConfigurableXMLContentExtractor [1]. Then in your > > > > search/dasl, you could say something like: > > > > > > > > <d:contains locale="abstract"> your query </> > > > > > > > > As 'locale' already indicates, it is actually implemented for > > > > different languages within one xml file, so you would > > > misuse it a little. > > > > > > > > OTOH, you might just keep working with your current > > > approach without > > > > real problems. Make sure, that for the abstract, you > configure the > > > > property in dasl-indexer.xml to be of type="text" (and use > > > > property-contains in your dasl instead op propcontains, see > > > [2]). For > > > > date and title you might want to choose to not do this > > > > > > > > -Ard > > > > > > > > [1] > > > > > > > > http://www.hippocms.org/display/CMS/Hippo+Repository+ConfigurableXML > > > Co > > > > nt > > > > entExtractor > > > > [2] http://www.hippocms.org/display/CMS/06.+Using+DASL+Queries > > > > > > > > > > > > > > Second question. > > > > > Is it possible to index (for searching) something without > > > > > storing its content? Just like in lucene: > > > > > Field.Index = true > > > > > Field.Store = false > > > > > > > > > > Regards, > > > > > Darek > > > > > ******************************************** > > > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > > > > > ******************************************** > > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > > > ******************************************** > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > ******************************************** > > Hippocms-dev: Hippo CMS development public mailinglist > > > ******************************************** > Hippocms-dev: Hippo CMS development public mailinglist > ******************************************** Hippocms-dev: Hippo CMS development public mailinglist
