Re: [MarkLogic Dev General] Processing Large Documents?

Geert Josten Mon, 20 Feb 2012 02:43:17 -0800

Hi Todd,



You can load a 154Gb file into MarkLogic (apparently), but if you start
working concurrently on that file, things are likely to become complicated.



It is true that MarkLogic still only supports XQuery 1.0 syntax. There are
a few interesting additions that are part of XQuery 3.0 (like try/catch),
when using xquery version ’1.0-ml’ declaration. For updating you will need
to use xdmp:document-insert (to insert or update entire documents), or
xdmp:node-replace and affiliated (for mutations within a document). Using
xdmp:document-insert within a FLWOR to chunk the large file is indeed the
best way to do it if you want to do it within MarkLogic. Similarly you will
need cts functions, or the search:search library functionality to do
advanced full-text searching.



About searching: if you use an expression like
cts:search(collection()//title, “xquery”), that will search the word
‘xquery’ within every title, and even return all title elements containing
that word. But, indexes are based on fragments, so if there are many titles
within one fragment, then performance will degrade. This is because each
fragment containing such titles is retrieved from the database, and
parsed/filtered to return only the positive matches. Not needing to
retrieve and parse those fragments will always perform best by far. So the
best strategy within MarkLogic is to select fragments roots for the search
expression, and use cts:highlight to isolate matching words (but only if
really necessary, so only for those you really want to show). Doing so will
make sure searching will scale well up to and (far) beyond millions of
records.



For instance: my Mark my Tweet demojam demo had roughly half a million docs
in the database, and was able to do at least 200 complex searches for
customized facet calculations, and sorted/paged search results within a few
seconds. At the time of the demojam, the number of docs was at least twice
as much as before I left home for the conference a few days before, and the
speed difference was hardly noticeable. You could also take a look at
http://markmail.org, that contains more than 50 mln docs.



MarkLogic is a specialized tool, it is not an ordinary XML database. You
can get a LOT of performance out of MarkLogic, but you will indeed need to
tune your stored data in a few ways to match what you are trying to
achieve. I think it can handle hierarchical structures just as well, but
you might consider chunking them on logical search/edit fragments, and
preserve the hierarchy through a directory-structure, or by linking. I have
seen plenty examples with MarkLogic doing so, and performing well at it.



Kind regards,

Geert



*Van:* [email protected] [mailto:
[email protected]] *Namens *Todd Gochenour
*Verz**onden:* maandag 20 februari 2012 8:47
*Aan:* MarkLogic Developer Discussion
*Onderwerp:* Re: [MarkLogic Dev General] Processing Large Documents?



This is my second day spent working with MarkLogic, having just come back
this week from XMLPrague.  So everything in my system to date is default
configuration, straight out of the box.   I have seen the "Fragment Roots"
and "Fragment Parents" nodes listed under my database in the configure
database view.  I've not done enough research to know yet how these should
be configured.   I'm getting the impression that MarkLogic requires more
custom tuning than I've had to do with eXistDB.



Again, to date I've loaded the large file and done my first transformation
using XQuery and this seems to perform well.  Perhaps my next
transformation should chunk this single file into multiple files.  I assume
I can do this with XQuery.   It appears that MarkLogic doesn't implement
the XQuery Update Facility recommendation but instead has proprietary
functions to accomplish inserts and updates.  What I've found so far is the
xdmp:document-insert() function, which I believe I can call multiple times
within a for iterator within XQuery.  This is how I think I'd like to
accomplish the chunking.



As for indexing, I've been accustomed (with eXistDB) when doing full-text
searches for words to receive results returning the parent/ancestor element
which contain the word, not the fragment root as represented by the chunked
file.   The structure is part of the index so when the word is found, the
parent and ancestors elements can be immediately identified.  I believe the
same is possible with xDB.  Does MarkLogic perform in a similar fashion?



In the past I've worked with large documents that were not 'record'
oriented but were instead technical documents with a deep hierarchy
consisting of chapter/section/subject/paragraphs.   Determining the
appropriate level to chunk this file is more difficult than a classic
database/table/row/field hierarchy.   The "container" nodes in this case
actually contain distingishing attributes which make them necessary to
maintain.



I would like to find a system that can handle deep hierarchies
without penalizing performance.

On Mon, Feb 20, 2012 at 12:12 AM, Geert Josten <[email protected]>
wrote:

Hi Todd,



I know a few tricks that could help getting this done with information
studio. One of which is putting your XQuery in a custom XQuery transform.
But you need to copy things like collection from the input file, and some
other properties as well, to make sure resulting files are treated properly
in the flow.



But first: are you using the database’s fragmentation options to load your
154Gb file?



Kind regards,

Geert



*Van:* [email protected] [mailto:
[email protected]] *Namens *Todd Gochenour
*Verzonden:* maandag 20 februari 2012 2:00
*Aan:* MarkLogic Developer Discussion
*Onderwerp:* [MarkLogic Dev General] Processing Large Documents?



I have a 154Gig file representing a data dump from MySQL that I want to
load into MarkLogic and analyze.



When I use the flow editor to collect/load this file into an empty
database, it takes 33 seconds.



When I add two delete element transforms to the flow the load fails with a
timeout error after several minutes. One was to remove <table_structure/>,
as this schema information isn't necessary for my analysis.  The second
removed elements with empty contents using the *[not(text())] xpath
expression.



I gather from this that the transform phase does not operate on XML files
in a streaming mode.  Does there exist a custom transform that can work on
a stream of data, say by using Saxon's streaming functionality or a StAX
transformation?  I would expect an ETL tool to be able to handle large
files.



After loading this file huge file without the transform into MarkLogic, I
then wrote the following XQuery which when run in the Query Console was
able to delete these elements and perform an element name transformation as
this operation performed in 15 seconds and reduced the 154Gig file to
6Gigs.  This process handles the ETL functionality with great performance.



The original record reads:



 <table_data name="cli">
 <row>
  <field name="id">1</field>
  <field name="org_id">1</field>
 </row>
....



will be transformed into:



 <cli>
    <id>1</id>
    <org_id>1</org_id>
  </cli>
 ...



with this XQuery:



let $doc := element {/*/*/@name)} {
  for $row in /*/*/table_data/row
  return element {$row/../@name} {
    for $field in $row/field[text()]
    return element {$field/@name} {$field/text()}
  }
}
return xdmp:document-insert("{/*/*/@name}.xml", $doc)



My next step in this process is to write a transform which de-normalizes
the SQL tables into nested element structure and thus removing all the
primary/foreign keys which have no semantic purpose other than to identify
relationships.  I'd like to be able to automate this transformation using
the Information Center Flow Editor rather than doing it manually in the
Query Console.


_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Processing Large Documents?

Reply via email to