[MarkLogic Dev General] RE: RE: Creating Collections

Kelly Stirman Sun, 22 Nov 2009 08:47:20 -0800

David,

I'm glad I asked why you're searching two documents. :)


If you have 20,000 top level elements in a large document, and you also wish to 
return those elements as results, then to me it seems like you probably want to 
load them as individual documents. (To me your "document" sounds more like an 
archive than a document.) There are several reasons for this.

1) the performance of queries will be better
2) relevance is determined at the document level, so all the matching nodes in 
your large document will have the same score
3) you may be able to use the searchAPI to do most of what you're trying to do

If you follow this approach, you could group all these documents in a common 
collection or directory. Or, perhaps they all have a common root node. Keep in 
mind if you do:

cts:search(/my-root-node,$query)

You will only search documents with that root node. No need to specify a 
specific collection or document name. You can also do:

cts:search(collection("my-collection")/my-root-node,$query)

or

cts:search(xdmp:directory("/my-directory/","infinity")/my-root-node,$query)

In both of these approaches, the collection and directory constraints will be 
combined with the root node constraint.

As Geert says, you can use cts:collection-query() and cts:directory-query() 
instead, if you want to group all of your constraints in a cts:query, which I 
think is generally a best practice. However, it is a little easier to explain 
the XPath part of the constraints with the syntax above. :)

To load these individual nodes as documents, you should take a look at 
RecordLoader - that's probably the easiest and most efficient way.

Now, there is a concept in MarkLogic called fragmentation which allows you to 
store very large documents, and to perform minimal disk IO when retrieving or 
updating the individual fragments. This is a very useful feature. However, for 
search applications, the best practice is to load the individual nodes as 
documents. If there is metadata that applies to all your individual nodes, then 
we can talk about how you might deal with that.

I think one lesson here is that it is a good idea to step back and talk about 
the bigger picture - what are you trying to do in your application? This will 
help us to recommend the best approach. Right now it still seems mysterious. :)

Kelly

Thanks for the suggestion.
Collections are probably what I want to use, but I'm just experimenting to try 
to learn.
In my case I have 2 documents (both large, about 500MB each, with about 20,000 
top level elements each) where I want to do a single search and produce a 
single set of results ordered by score.
This will lead to a page which shows "Top 10 results"  where if you click on 
the link it will go to the detail for that record.  These documents are from 
the same category of data, but are totally different
schemas.   They happen right now to reside in a directory with 3 other
documents I do NOT want to search.  So my choices are 

1) Move these 2 into their own private directory
2) Create a collection with these 2 docs (learning how to do that now, thanks 
to the group !)
3) Search the 2 docs seperately and try to combine the results (very ugly , all 
sorts of extra work needed to combine the results)
4) Search the 2 documents in ONE search.   Thanks to the suggestion of
using multiple documents to fn:doc() I found this search works *exactly how I 
want*

The trick here is that I want a particular sub-record returned as the result, 
not the entire document, so hence the funky xpath expression after doc() to 
pick the 2 kinds of sub record sets I want searched.


cts:search( 
   fn:doc(("/Mesh/desc2010.xml","/Mesh/supp2010.xml"))/
 
(DescriptorRecordSet/DescriptorRecord|SupplementalRecordSet/Supplemental
Record), 
   "Codeine")


This Works !!!!!


And it beats the hell out of my 3 solution which is
        cts:search( doc("doc1") ) | cts:search(doc("doc2")) ...

because it seems like now the results are combined which means
remainder() should work on the combined set and I dont have to re-order the 
results !

I will definitely look into making a collection and/or separating these
2 docs into their own directory.  But good to know I can do this way too.
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

[MarkLogic Dev General] RE: RE: Creating Collections

Reply via email to