RE: [MarkLogic Dev General] RE: RE: Creating Collections

Lee, David Sun, 22 Nov 2009 11:18:18 -0800

Thanks.  These are really good points.  I had never even considered
overlapping fragments ! (Yuck !)
Having a uri to each chunk is indeed good.  This is a good discussion.


Here are some things though that may lead one to consider the reverse
case ... 

1) A document takes at min 1 fragment. Its documented (no pun) that
fragments should be in the 10k - 100k range ideally.
  Some of my big documents are actually 100k'+ of 'mini documents' which
are maybe 100-500 bytes each.
  I have experienced, and also been advised, that fragmentation that
small can lead to performance problems. 
  --> Maybe a good rule-of-thumb is dont split up docs if the mini-docs
are "tiny" (where "tiny" is some number like < 1k ? )
  But then what to do if you have 100MB of 100 byte elements ? your up
the creek either way.
  Split documents to have 100 elements each ?  That sorta  scares me,
should it ?     



2) I have to name the documents.  I certianly can create unique ids, and
sometimes elements have obvious ID's. 
   But its not always obvious what to name them.   Sometimes choosing
names can lead to bad problems like when you thought ID's were unique
but they were not,
   or you just choose "1.xml" , "2.xml" etc ... 

3) Overhead.  I think there is a fundimental minimum overhead beyond the
fragment level to each doc, much like in a normal filesystem,
  its generally 'bad practice' (<shudder> I tend to get the willies when
people say "best practice" or "bad practice" ... but Oh well :) 
   to have lots of tiny files.    The system has to maintain meta-data
for each document that it does not for each fragment.
  How bad off is this ? I dont have a clue !

4) Bulk Updates/deltas/delets
  If a document is split to mini-docs (say 100k mini docs). and I get a
single bulk update from the source, 
  Its much added complexity to try to figure out the add/update/delete
logic.
  I suppose one could just delete the whole directory and start from
scratch ... 



Another advantage you didnt mention is ...

4) Collections.   You can assign documents to collections but not
fragments to collections. 

4) Configuration privs.  Sometime - possibly soon now if I'm lucky (or
un-lucky) we'll have a ML server owned by the IT or DBA dept.
   At this point I fully expect that it will then be "locked down" much
like our Oracle DB is, and us poor users will only be granted
  the bare minimul rights.   At that point I may not have the luxury of
specifying fragment roots or fragment parents withough going through the
grueling 
  process of IT beuracracy.  But I almost certianly will have the rights
to create documents at will.
  That may in fact be the *biggest* advantage overall.





-----Original Message-----
From: [email protected]
[mailto:[email protected]] On Behalf Of Michael
Sokolov
Sent: Sunday, November 22, 2009 1:59 PM
To: 'General Mark Logic Developer Discussion'
Subject: RE: [MarkLogic Dev General] RE: RE: Creating Collections 

Probably someone from MarkLogic will give you a more thorough answer,
but
our experience with sub-document fragments has led us ultimately, to
chunk
into separate documents in most cases, rather than relying on the
fragmentation configuration.  Here are some reasons:

1) overlapping fragments lead to trouble.  It has often happened that we
*thought* we were fragmenting documents into non-overlapping chunks, but
in
fact the rules we set up led to nested fragments, and this can
negatively
impact performance.  If you explicitly chunk your content into
documents, it
forces these issues out into the open.

2) Documents have uris.  These are supported by various useful features
in
the language (like the doc() function), and in MarkLogic's extensions
(see,
eg uri-lexicon).  Sub-document fragments don't have any first-class
existence.  This also means that documents can be accessed using
directories, and this makes them accessible to standard browse
interfaces
like WebDav.  No similar tools exist for fragments.

For example, if you have a random node, and you want to know what
document
it's a part of, you can call base-uri($node).  There's no analogous
capability for some arbitrary fragment: you'd have to use special
knowledge
of your own data structure to do something like that (eg:
$node/ancestor-or-self::my-fragment-element[1]/@id)

3) There is the relevance consideration Kelly raised, which could be
critical if you are relying on relevance ranking of search results.

4) I think there may be some locking/transactional isolation behaviors
that
are different between fragments and documents, but I'm not sure: perhaps
they actually behave the same w.r.t. updates.

-Mike

> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of 
> Lee, David
> Sent: Sunday, November 22, 2009 11:56 AM
> To: General Mark Logic Developer Discussion
> Subject: RE: [MarkLogic Dev General] RE: RE: Creating Collections 
> 
> Good suggestion about separate documents.    In fact these particular
> documents are just lists of identical smaller things, just 
> like you surmise.
> 
> You say  
> 
> "Now, there is a concept in MarkLogic called fragmentation 
> which allows you to store very large documents, and to 
> perform minimal disk IO when retrieving or updating the 
> individual fragments. This is a very useful feature. However, 
> for search applications, the best practice is to load the 
> individual nodes as documents. If there is metadata that 
> applies to all your individual nodes, then we can talk about 
> how you might deal
> with that."    
> 
> 
> Is this really fundamentally true ?  I hear conflicting statements.
> How have you determined this "best practice" ?
> 
> I've been using Fragment Parents so that this "big document" 
> is fragmented into individual fragments, without having to 
> create separate "documents".
> I have no need to apply meta-data to these mini-docs at all. 
> 
> Is it really fundamentally true that given the same data set, 
>  that splitting them into documents, instead of fragments, 
> improves performance ?
> The performance I'm getting is phenomenal,  and I have read 
> implied in many places in the ML documentation that 
> fragmenting documents is a great way of doing things.
> 
> Besides meta-data associated with each mini-doc,  do I really 
> truly gain an advantage by splitting the big doc to littler docs ?
> That seems contrary to what I'm reading in the ML 
> documentation.     One
> of the huge advantages I see with simply storing this 
> mega-document (in MarkLogic as apposed to my old way of 
> thinking 'file based' XML)  is that it seems to work 
> perfectly as-is, and it seems to me an unnecessary complexity 
> to split it up unless there are hard gains to it.  
> 
> I can certianly do some tests, but I'd love someone who knows 
> the authoritative answer, or even hard anecdotal evidence, to comment.
> 
> 
> -David
> 
> 
> 
> 
> 
> 
> 
> 
> 
> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of 
> Kelly Stirman
> Sent: Sunday, November 22, 2009 11:47 AM
> To: [email protected]
> Subject: [MarkLogic Dev General] RE: RE: Creating Collections 
> 
> David,
> 
> I'm glad I asked why you're searching two documents. :)
> 
> If you have 20,000 top level elements in a large document, 
> and you also wish to return those elements as results, then 
> to me it seems like you probably want to load them as 
> individual documents. (To me your "document" sounds more like 
> an archive than a document.) There are several reasons for this.
> 
> 1) the performance of queries will be better
> 2) relevance is determined at the document level, so all the 
> matching nodes in your large document will have the same score
> 3) you may be able to use the searchAPI to do most of what 
> you're trying to do
> 
> If you follow this approach, you could group all these 
> documents in a common collection or directory. Or, perhaps 
> they all have a common root node. Keep in mind if you do:
> 
> cts:search(/my-root-node,$query)
> 
> You will only search documents with that root node. No need 
> to specify a specific collection or document name. You can also do:
> 
> cts:search(collection("my-collection")/my-root-node,$query)
> 
> or
> 
> cts:search(xdmp:directory("/my-directory/","infinity")/my-root
-node,$que
> ry)
> 
> In both of these approaches, the collection and directory 
> constraints will be combined with the root node constraint.
> 
> As Geert says, you can use cts:collection-query() and
> cts:directory-query() instead, if you want to group all of 
> your constraints in a cts:query, which I think is generally a 
> best practice.
> However, it is a little easier to explain the XPath part of 
> the constraints with the syntax above. :)
> 
> To load these individual nodes as documents, you should take 
> a look at RecordLoader - that's probably the easiest and most 
> efficient way.
> 
> Now, there is a concept in MarkLogic called fragmentation 
> which allows you to store very large documents, and to 
> perform minimal disk IO when retrieving or updating the 
> individual fragments. This is a very useful feature. However, 
> for search applications, the best practice is to load the 
> individual nodes as documents. If there is metadata that 
> applies to all your individual nodes, then we can talk about 
> how you might deal with that.
> 
> I think one lesson here is that it is a good idea to step 
> back and talk about the bigger picture - what are you trying 
> to do in your application? This will help us to recommend the 
> best approach. Right now it still seems mysterious. :)
> 
> Kelly
> 
> Thanks for the suggestion.
> Collections are probably what I want to use, but I'm just 
> experimenting to try to learn.
> In my case I have 2 documents (both large, about 500MB each, 
> with about 20,000 top level elements each) where I want to do 
> a single search and produce a single set of results ordered by score.
> This will lead to a page which shows "Top 10 results"  where 
> if you click on the link it will go to the detail for that 
> record.  These documents are from the same category of data, 
> but are totally different
> schemas.   They happen right now to reside in a directory with 3 other
> documents I do NOT want to search.  So my choices are 
> 
> 1) Move these 2 into their own private directory
> 2) Create a collection with these 2 docs (learning how to do 
> that now, thanks to the group !)
> 3) Search the 2 docs seperately and try to combine the 
> results (very ugly , all sorts of extra work needed to 
> combine the results)
> 4) Search the 2 documents in ONE search.   Thanks to the suggestion of
> using multiple documents to fn:doc() I found this search 
> works *exactly how I want*
> 
> The trick here is that I want a particular sub-record 
> returned as the result, not the entire document, so hence the 
> funky xpath expression after doc() to pick the 2 kinds of sub 
> record sets I want searched.
> 
> 
> cts:search( 
>    fn:doc(("/Mesh/desc2010.xml","/Mesh/supp2010.xml"))/
>  
> (DescriptorRecordSet/DescriptorRecord|SupplementalRecordSet/Su
> pplemental
> Record), 
>    "Codeine")
> 
> 
> This Works !!!!!
> 
> 
> And it beats the hell out of my 3 solution which is
>       cts:search( doc("doc1") ) | cts:search(doc("doc2")) ...
> 
> because it seems like now the results are combined which means
> remainder() should work on the combined set and I dont have 
> to re-order the results !
> 
> I will definitely look into making a collection and/or 
> separating these
> 2 docs into their own directory.  But good to know I can do 
> this way too.
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> 

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

RE: [MarkLogic Dev General] RE: RE: Creating Collections

Reply via email to