RE: [MarkLogic Dev General] RE: RE: Creating Collections

Lee, David Sun, 22 Nov 2009 08:56:16 -0800

Good suggestion about separate documents.    In fact these particular
documents are just lists of identical smaller things, just like you
surmise.


You say  

"Now, there is a concept in MarkLogic called fragmentation which allows
you to store very large documents, and to perform minimal disk IO when
retrieving or updating the individual fragments. This is a very useful
feature. However, for search applications, the best practice is to load
the individual nodes as documents. If there is metadata that applies to
all your individual nodes, then we can talk about how you might deal
with that."    


Is this really fundamentally true ?  I hear conflicting statements.
How have you determined this "best practice" ?

I've been using Fragment Parents so that this "big document" is
fragmented into individual fragments, without having to create separate
"documents".
I have no need to apply meta-data to these mini-docs at all. 

Is it really fundamentally true that given the same data set,  that
splitting them into documents, instead of fragments, improves
performance ?
The performance I'm getting is phenomenal,  and I have read implied in
many places in the ML documentation that fragmenting documents is a
great way of doing things.

Besides meta-data associated with each mini-doc,  do I really truly gain
an advantage by splitting the big doc to littler docs ?
That seems contrary to what I'm reading in the ML documentation.     One
of the huge advantages I see with simply storing this mega-document (in
MarkLogic as apposed to my old way of thinking 'file based' XML)  is
that it seems to work perfectly as-is, and it seems to me an unnecessary
complexity to split it up unless there are hard gains to it.  

I can certianly do some tests, but I'd love someone who knows the
authoritative answer, or even hard anecdotal evidence, to comment.


-David









-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Kelly
Stirman
Sent: Sunday, November 22, 2009 11:47 AM
To: [email protected]
Subject: [MarkLogic Dev General] RE: RE: Creating Collections 

David,

I'm glad I asked why you're searching two documents. :)

If you have 20,000 top level elements in a large document, and you also
wish to return those elements as results, then to me it seems like you
probably want to load them as individual documents. (To me your
"document" sounds more like an archive than a document.) There are
several reasons for this.

1) the performance of queries will be better
2) relevance is determined at the document level, so all the matching
nodes in your large document will have the same score
3) you may be able to use the searchAPI to do most of what you're trying
to do

If you follow this approach, you could group all these documents in a
common collection or directory. Or, perhaps they all have a common root
node. Keep in mind if you do:

cts:search(/my-root-node,$query)

You will only search documents with that root node. No need to specify a
specific collection or document name. You can also do:

cts:search(collection("my-collection")/my-root-node,$query)

or

cts:search(xdmp:directory("/my-directory/","infinity")/my-root-node,$que
ry)

In both of these approaches, the collection and directory constraints
will be combined with the root node constraint.

As Geert says, you can use cts:collection-query() and
cts:directory-query() instead, if you want to group all of your
constraints in a cts:query, which I think is generally a best practice.
However, it is a little easier to explain the XPath part of the
constraints with the syntax above. :)

To load these individual nodes as documents, you should take a look at
RecordLoader - that's probably the easiest and most efficient way.

Now, there is a concept in MarkLogic called fragmentation which allows
you to store very large documents, and to perform minimal disk IO when
retrieving or updating the individual fragments. This is a very useful
feature. However, for search applications, the best practice is to load
the individual nodes as documents. If there is metadata that applies to
all your individual nodes, then we can talk about how you might deal
with that.

I think one lesson here is that it is a good idea to step back and talk
about the bigger picture - what are you trying to do in your
application? This will help us to recommend the best approach. Right now
it still seems mysterious. :)

Kelly

Thanks for the suggestion.
Collections are probably what I want to use, but I'm just experimenting
to try to learn.
In my case I have 2 documents (both large, about 500MB each, with about
20,000 top level elements each) where I want to do a single search and
produce a single set of results ordered by score.
This will lead to a page which shows "Top 10 results"  where if you
click on the link it will go to the detail for that record.  These
documents are from the same category of data, but are totally different
schemas.   They happen right now to reside in a directory with 3 other
documents I do NOT want to search.  So my choices are 

1) Move these 2 into their own private directory
2) Create a collection with these 2 docs (learning how to do that now,
thanks to the group !)
3) Search the 2 docs seperately and try to combine the results (very
ugly , all sorts of extra work needed to combine the results)
4) Search the 2 documents in ONE search.   Thanks to the suggestion of
using multiple documents to fn:doc() I found this search works *exactly
how I want*

The trick here is that I want a particular sub-record returned as the
result, not the entire document, so hence the funky xpath expression
after doc() to pick the 2 kinds of sub record sets I want searched.


cts:search( 
   fn:doc(("/Mesh/desc2010.xml","/Mesh/supp2010.xml"))/
 
(DescriptorRecordSet/DescriptorRecord|SupplementalRecordSet/Supplemental
Record), 
   "Codeine")


This Works !!!!!


And it beats the hell out of my 3 solution which is
        cts:search( doc("doc1") ) | cts:search(doc("doc2")) ...

because it seems like now the results are combined which means
remainder() should work on the combined set and I dont have to re-order
the results !

I will definitely look into making a collection and/or separating these
2 docs into their own directory.  But good to know I can do this way
too.
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] RE: RE: Creating Collections

Reply via email to