Re: [MarkLogic Dev General] Re: Unfiltered ok, but what of fragment loading (Jason Hunter)

Jason Hunter Thu, 18 Mar 2010 17:01:29 -0700

It's not so much the cache you need to worry about, it's fragment reads off 
disk.  To fetch data from the ten items starting at item #1,000,000 you really 
don't want to have to read the previous million fragments off disk.  That's a 
lot of random seeks.  That's where unfiltered helps you; it lets you jump ahead 
and read just the 10 that matter.


The range index on the other hand is how a site like MarkMail.org can give you 
statistics about your result set (the authors, etc) without having to read the 
result set off disk.

-jh-

On Mar 18, 2010, at 12:16 PM, Paul M wrote:

> So unfiltered lets one go deep in paging...
> 1,000,001 to 1,000,010
> filtered may max out the caches earlier
> 200,001 to 200,010 max next page out of cache.
> 
> Memory and fragmentation are still the main factors affecting total records
> [990,000 to 1,000,010] // authors
> because if the fragments are small, KB vs MB, more can be loaded...
> P.S. expanded cache is the one that will be used, this is the one that is 
> filled from disk, correct?
> 
> And range indexes can be used to avoid disk access all together (for small 
> bits of information)
> 
> 
> --- On Thu, 3/18/10, [email protected] 
> <[email protected]> wrote:
> 
> From: [email protected] 
> <[email protected]>
> Subject: General Digest, Vol 69, Issue 66
> To: [email protected]
> Date: Thursday, March 18, 2010, 11:46 AM
> 
> Send General mailing list submissions to
>     [email protected]
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>     http://xqzone.com/mailman/listinfo/general
> or, via email, send a message with subject or body 'help' to
>     [email protected]
> 
> You can reach the person managing the list at
>     [email protected]
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of General digest..."
> 
> 
> Today's Topics:
> 
>    1. RE: Unfiltered ok,    but what of fragment loading (Kelly Stirman)
>    2. Re: "Hot Swapping" large data sets. (Jason Hunter)
>    3. Re: Unfiltered ok,    but what of fragment loading (Jason Hunter)
>    4. MLSQL - JDOM version? (Wyatt VanderStucken)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Thu, 18 Mar 2010 10:33:34 -0700
> From: Kelly Stirman <[email protected]>
> Subject: [MarkLogic Dev General] RE: Unfiltered ok,    but what of
>     fragment loading
> To: "[email protected]"
>     <[email protected]>
> Message-ID:
>     <[email protected]>
> Content-Type: text/plain; charset="us-ascii"
> 
> If you want to get only the authors and their values, you should take a look 
> at cts:element-values() or cts:element-attribute-values(). This will require 
> creating a range index on the node where your authors are stored, but it will 
> eliminate the need to pull all documents into memory.
> 
> You can also use cts:frequency() to determine how frequently the author is 
> mentioned across all 300 documents.
> 
> Kelly
> 
> Message: 2
> Date: Thu, 18 Mar 2010 07:07:16 -0700 (PDT)
> From: Paul M <[email protected]>
> Subject: [MarkLogic Dev General] Unfiltered ok, but what of fragment
>     loading
> To: [email protected]
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="us-ascii"
> 
> Say I perform an unfiltered search that resolves to 300 fragments. Now, since 
> it was unfiltered, no fragments were needed, for the *search*, to be loaded 
> into memory. Only the indexes were used. Now lets say I want the authors from 
> all these fragments/docs (fragment=doc since no fragmentation policy). The 
> data still needs to be loaded into memory for all 300 docs even if I only a 
> small piece? i.e. expanded/compressed caches(not certain?) will need to be 
> filled with 300 docs?
> i.e. Even if a search can be performed without pagination, this does not save 
> one from blowing out the caches when the data is retrieved from the docs? 
> Pagination may still be required?
> 
> Any information is appreciated...
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Thu, 18 Mar 2010 11:07:07 -0700
> From: Jason Hunter <[email protected]>
> Subject: Re: [MarkLogic Dev General] "Hot Swapping" large data sets.
> To: General Mark Logic Developer Discussion
>     <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="windows-1252"
> 
> For a single batch load, I like that, but if you do repeated loads you'll 
> have to be creating new roles for every batch to distinguish the new content 
> from the old.  It seems mentally cheaper/lighter to me to use collections.  
> My 2c.
> 
> -jh-
> 
> On Mar 18, 2010, at 9:47 AM, Danny Sokolsky wrote:
> 
> > The URI privilege does not control access to the document, it specifies 
> > whether you can create a document in that URI space.
> >  
> > You can do what Keith suggests by putting a read permission on each 
> > document that is associated with a role.  Then, when you are ready, grant 
> > that role to a role your users already have.  To do this, you would have to 
> > add several permissions during the load.  For example, you might add a read 
> > and update permission for a “loader” role, and also add a read permission 
> > for a “content-user” role.  Then, after you are satisfied that your content 
> > is the way you want it, you can give the “content-user” role to the user of 
> > your application.
> >  
> > -Danny
> >  
> > From: [email protected] 
> > [mailto:[email protected]] On Behalf Of Keith L. 
> > Breinholt
> > Sent: Thursday, March 18, 2010 9:34 AM
> > To: General Mark Logic Developer Discussion
> > Subject: RE: [MarkLogic Dev General] "Hot Swapping" large data sets.
> >  
> > Another way to allow you to load and update sets and then only make them 
> > visible when you are done is to load the content with a unique URI 
> > privilege that is assigned to your loader/enricher program.
> >  
> > Then when you are done and the content is ready you can add that privilege 
> > to the role of any users/applications that need to see it.  That way only 
> > completed content is visible and it appears ‘instantaneously’ when the 
> > privilege is added to the role.
> >  
> > Keith L. Breinholt
> > [email protected]
> >  
> > From: [email protected] 
> > [mailto:[email protected]] On Behalf Of Jason Hunter
> > Sent: Thursday, March 18, 2010 12:10 AM
> > To: General Mark Logic Developer Discussion
> > Subject: Re: [MarkLogic Dev General] "Hot Swapping" large data sets.
> >  
> > On Mar 17, 2010, at 5:23 AM, Lee, David wrote:
> >  
> > 
> > I need to be updating some largish (1G+) sets of documents fairly 
> > atomically.
> > That is, I'd like to update all the documents and perform some operations 
> > like adding properties etc,
> > then all at once make the updates visible.   The update process could take 
> > several hours.
> > Currently this document set shares the same forest as other document sets.
> > Its not possible to split these up because the app needs cross-query across 
> > all the document sets.
> >  
> > Any suggestions on how to accomplish this ?
> >  
> > What happens if you try loading everything as part of a single XCC call 
> > passing the large array of files?
> >  
> > If you want to follow Wayne's advice on using collections, I suppose you'd 
> > want to put each batch of docs in a uniquely named collection.  Then you 
> > can run your queries against fn:collection($seq) when $seq is the sequence 
> > of collections that have been loaded so far.  Or, perhaps more simply, you 
> > can do a cts:not-query() against the cts:collection-query("latest") and 
> > thus exclude the most recent batch but allow all other docs that were 
> > loaded before.  It keeps the new collection in the dark basically.  Handy, 
> > efficient, and if each batch gets its own ID then you can easily exclude 
> > any batch.
> >  
> > Point-in-time would do something similar, and is suitable if you're always 
> > doing just one bulk load at a time.  Then you can use the point in time to 
> > control the visibility.
> >  
> > -jh-
> >  
> > 
> > 
> > NOTICE: This email message is for the sole use of the intended recipient(s) 
> > and may contain confidential and privileged information. Any unauthorized 
> > review, use, disclosure or distribution is prohibited. If you are not the 
> > intended recipient, please contact the sender by reply email and destroy 
> > all copies of the original message.
> >  
> > _______________________________________________
> > General mailing list
> > [email protected]
> > http://xqzone.com/mailman/listinfo/general
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> http://xqzone.marklogic.com/pipermail/general/attachments/20100318/566bd15c/attachment-0001.html
> 
> ------------------------------
> 
> Message: 3
> Date: Thu, 18 Mar 2010 11:14:05 -0700
> From: Jason Hunter <[email protected]>
> Subject: Re: [MarkLogic Dev General] Unfiltered ok,    but what of
>     fragment loading
> To: General Mark Logic Developer Discussion
>     <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset="us-ascii"
> 
> > 
> > i.e. Even if a search can be performed without pagination, this does not 
> > save one from blowing out the caches when the data is retrieved from the 
> > docs? Pagination may still be required?
> 
> Others have answered how you can use range indexes to pull the data from 
> documents without fetching the documents, but in answer to this specific 
> question, the perk of an unfiltered search is you can get jump ahead 
> arbitrarily deep -- so you can get the authors of documents 1,000,001 to 
> 1,000,010 even without range indexes using only 10 fragment reads.  So you 
> won't blow out any caches.
> 
> -jh-
> 
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL: 
> http://xqzone.marklogic.com/pipermail/general/attachments/20100318/24ac755e/attachment-0001.html
> 
> ------------------------------
> 
> Message: 4
> Date: Thu, 18 Mar 2010 14:46:43 -0400
> From: Wyatt VanderStucken <[email protected]>
> Subject: [MarkLogic Dev General] MLSQL - JDOM version?
> To: General Mark Logic Developer Discussion
>     <[email protected]>
> Message-ID: <[email protected]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
> 
> Greetings all (particularly -jh-),
> 
> I've been experimenting w/ the latest MLSQL, and had a question 
> regarding the jdom.jar file which is included w/ the MLSQL 
> distribution.  The MANIFEST.MF inside the .jar indicates that it is JDOM 
> Implementation-Version: 1.0.1, but I don't see that version listed on 
> the JDOM site (http://www.jdom.org/news/index.html) - it looks like it 
> was built 9/14/2005...
> 
> Where it gets tricky is that I'm trying to add the MLSQL servlet to an 
> existing Java webapp where JDOM is already in use 
> (Implementation-Version: 1.0beta10)...
> 
> When I use the 1.0beta10 version I get the following error:
>      java.lang.NoSuchMethodError: 
> org.jdom.Element.addContent(Lorg/jdom/Content;)Lorg/jdom/Element;
> 
> The version bundled with MLSQL remedies the problem (as does JDOM 
> version 1.0), but I'm concerned that deploying a newer version will 
> break something.  Initial tests are good, but this is a large 
> application with 30+ developers, so I'm not sure of all the code that is 
> dependent on JDOM...
> 
> Can you say with any degree of certainty that code written against JDOM 
> 1.0beta10 will be compatible with JDOM version 1.0 or 1.0.1?  If forced 
> to, will MLSQL work with JDOM version 1.0?
> 
> Thanks in advance,
> Wyatt
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general
> 
> 
> End of General Digest, Vol 69, Issue 66
> ***************************************
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://xqzone.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://xqzone.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Re: Unfiltered ok, but what of fragment loading (Jason Hunter)

Reply via email to