Thanks again for your answer. Yes - it helps me a lot. Right now I was just playing with something like solution 2, but I must say i like first one. I have to give it a try, and we'll see what comes up :)
Regards, Darek 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>: > > Hello, > > > > > Unfortunately problem is more complicated, As I said before > > we have document (cms type): > > - title (simple text input) > > - newspaper (simple text input) > > - <url to pdf in /binaries> > > So stored xml in repository looks like this (type='media'): > > <?xml version="1.0" encoding="UTF-8"?> > > <root> > > <title>aa</title> > > <newspaper>bb</newspaper> > > <media>/binaries/en/acceptance-tdd.pdf</media> > > <date>2007-12-06T00:00:00.000Z</date> > > </root> > > > > Now. In portal I would like to have ability to sort and > > search in media type > > by: title, newspaper, date (we can do it by creating > > properties from these fields). Effect would be the list of > > found media types. > > > > But I would also like to search in media/media. So when > > someone types: 'tdd' > > and hits button search in pdf I can give him list of media > > types that contain pdf that contain phrase 'tdd'. Effect have > > to be the same: list of found media types. In this case: > > title | newspaper | getPdf > > aa | bb | link(getPdf) > > > > So the problem is not how to index pdf file but how to > > connect its content with media type. > > Yes, I understand your problem. IMO, you have two options, one you would > have to build something, the other you have to do some frontend and > indexing tricks: > > 1) You can create your own XMLContentExtractor, and when it encounters a > link to a binary, import that binary and index it along with the > document content. This will be feasible, not really hard, but also not > trivial. OTOH, you will end up with a situation, that when the pdf gets > deleted/changed, the indexed document (which indexed the pdf as well) > won't be aware of it. You might again add some logic, that check the > repository for all docs having a link to some pdf, and when a pdf is > changed, all docs using that pdf are re-indexed. But, you need some > engineering for it > > 2) This is the easiest way but requires more frontend things (and a big > disadvantage, it needs two searches): What you need to do, is > > - extract all binary link with a > nl.hippo.slide.extractor.MultiValueXMLPropertyExtractor (this extracts > all links comma seperated in one property) > > - add to the dasl-indexer.xml this property, and configure it to be > type="text" and analyzer LowercaseCommaSeparatedAnalyzer, for example: > <property > analyzer="nl.hippo.slide.index.analysis.LowercaseCommaSeparatedAnalyzer" > name="references" namespace="http://hippo.nl/cms/1.0" type="text"/> > > - if you now do a search in pdf, you get some results (repository > locations in href attr) > > - With these results, you construct a new dasl that says: give me all > documents having one or more links to one of these results. The dasl > that does that is (target is just /files/project.preview / live) > > <d:where> > <d:or> > <d:propcontains> > <d:prop>references</d:prop> > <d:literal>firstfound-binary-link</d:literal> > </d:propcontains> > <d:propcontains> > <d:prop>references</d:prop> > <d:literal>secondfound-binary-link</d:literal> > </d:propcontains> > <d:propcontains> > <d:prop>references</d:prop> > <d:literal>thirdfound-binary-link</d:literal> > </d:propcontains> > etc etc > </d:or> > </d:where> > > Obviously, solution 1 is nicer in the long run (and if you are capable > of building it (depends on how muich time you have) you certainly may > send a patch :-)) but is also much harder! > > Hope this helps, > > Regards Ard > > ps > > 3) Option three is by the way your previously suggested putting the > string contents of the pdf as a property on the document, but as you > indicated, it is quite some code duplication, and I would discourage you > to add such 'large' properties on documents (depending your db settings > i am not sure wether there are limits) > > > > > > > Again thank you for answers, > > Darek > > > > > > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>: > > > > > > > > > Hello, > > > > > > > Thank for fast reply and clarification, I also would like to ask > > > > about indexing without storing. > > > > We have document like this: > > > > - title > > > > - newspaper > > > > - <url to pdf in /binaries> > > > > And now we need ability to search documents that have > > some text in > > > > pdf. I want to to this by writing my extractor that will > > take pdf, > > > > extract text and put it in property. As you said before > > this is the > > > > way it should be done. > > > > > > No, you really shouldn't do this. I think you are confused about the > > > indexing: > > > > > > You do not need to put everything in a property to have it indexed!! > > > Normally, I put things in an index I want to specifically search on > > > (like, I have a <title> field in my xml, but want to be > > able to search > > > on title only, and not on the entire xml. Then I extract title as a > > > property.) > > > > > > But, not everything configured in extractors will actually > > be used to > > > set a property on a document. I do agree with you that it > > is a little > > > confusing: > > > > > > 1) Extractors with an instruction are used to extract a property, > > > which is set on the document, and indexed according the > > configuration > > > in dasl-indexer.xml for this property > > > 2) Extractors without an instruction do not put a property on a > > > document, but are only used during indexing! > > > > > > So, for example, if I configure: > > > > > > <!-- XML content extractor --> > > > <extractor classname="nl.hippo.slide.extractor.XMLContentExtractor" > > > uri="/files" content-type="text/xml"/> > > > > > > it means that *all* xml content of xml docs under /files > > are indexed > > > (according the global indexer in dasl-indexer.xml, default > > > nl.hippo.slide.index.analysis.SimpleStandardAnalyzer), without a > > > property being extracted. > > > > > > Now, for you pdf / word etc all you need to add is something like: > > > > > > <extractor classname="org.apache.slide.extractor.MSWordExtractor" > > > uri="/files/project.preview/binaries" > > > content-type="application/msword"/> > > > > > > or > > > > > > <extractor classname="org.apache.slide.extractor.PDFExtractor" > > > uri="/files/project.preview/binaries" > > > content-type="application/pdf"/> > > > > > > where project is your realm/workspace see [1]. > > > > > > If you add these extractors, stop the repository, delete the lucene > > > index, and restart the lucene index will be recreated. Now, when > > > searching/doing a dasl with > > > > > > <d:contains>foo</d:contains> > > > > > > you will get hits for all xml documents, but also pdf documents > > > containing 'foo'. > > > > > > Hope things are a little more clear, > > > > > > Regards Ard > > > > > > [1] > > > > > http://www.hippocms.org/display/CMS/4.+Hippo+Repository+Configure+Extr > > > ac > > > > > tors#4.HippoRepositoryConfigureExtractors-org.apache.slide.extractor.M > > > SW > > > ordExtractor > > > > > > > But in this case its completely unnesesary to keep pdf's text. > > > > Is there a way to avoid duplication? > > > > > > > > Darek > > > > > > > > > > > > 2007/12/20, Ard Schrijvers <[EMAIL PROTECTED]>: > > > > > > > > > > > > > > > Hello Darek, > > > > > > > > > > > Hello, > > > > > > I was looking for these information in docs, lists and found > > > > > > nothing. If I repeated a problem - then sorry :) > > > > > > > > > > > > We have a problem with searching over documents. Lets say > > > > we have a > > > > > > document that consists of : title, date, abstract. > > > > > > We need ability to search over these fields separately. > > > > > > We did that by making extractors that rewrite these fields to > > > > > > properties p_title, p_date, p_abstract. Now lucene can > > > > index it and > > > > > > it works. > > > > > > But ... > > > > > > Now we have same content in 2 places. > > > > > > Is there a better way to do this? > > > > > > > > > > In principle, this is the way to do it. For a title and a > > > > date, it is > > > > > pretty normal and straightforward. For the abstract you > > > > might not want > > > > > to duplicate the entire text. For the abstract you > > might also work > > > > > with ConfigurableXMLContentExtractor [1]. Then in your > > > > > search/dasl, you could say something like: > > > > > > > > > > <d:contains locale="abstract"> your query </> > > > > > > > > > > As 'locale' already indicates, it is actually implemented for > > > > > different languages within one xml file, so you would > > > > misuse it a little. > > > > > > > > > > OTOH, you might just keep working with your current > > > > approach without > > > > > real problems. Make sure, that for the abstract, you > > configure the > > > > > property in dasl-indexer.xml to be of type="text" (and use > > > > > property-contains in your dasl instead op propcontains, see > > > > [2]). For > > > > > date and title you might want to choose to not do this > > > > > > > > > > -Ard > > > > > > > > > > [1] > > > > > > > > > > > http://www.hippocms.org/display/CMS/Hippo+Repository+ConfigurableXML > > > > Co > > > > > nt > > > > > entExtractor > > > > > [2] http://www.hippocms.org/display/CMS/06.+Using+DASL+Queries > > > > > > > > > > > > > > > > > Second question. > > > > > > Is it possible to index (for searching) something without > > > > > > storing its content? Just like in lucene: > > > > > > Field.Index = true > > > > > > Field.Store = false > > > > > > > > > > > > Regards, > > > > > > Darek > > > > > > ******************************************** > > > > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > > > > > > > ******************************************** > > > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > > > > > ******************************************** > > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > > > ******************************************** > > > Hippocms-dev: Hippo CMS development public mailinglist > > > > > ******************************************** > > Hippocms-dev: Hippo CMS development public mailinglist > > > ******************************************** > Hippocms-dev: Hippo CMS development public mailinglist > ******************************************** Hippocms-dev: Hippo CMS development public mailinglist
