Re: [MarkLogic Dev General] Processing Large Documents?

Geert Josten Sun, 19 Feb 2012 23:13:04 -0800

Hi Todd,



I know a few tricks that could help getting this done with information
studio. One of which is putting your XQuery in a custom XQuery transform.
But you need to copy things like collection from the input file, and some
other properties as well, to make sure resulting files are treated properly
in the flow.



But first: are you using the database’s fragmentation options to load your
154Gb file?



Kind regards,

Geert



*Van:* [email protected] [mailto:
[email protected]] *Namens *Todd Gochenour
*Verzonden:* maandag 20 februari 2012 2:00
*Aan:* MarkLogic Developer Discussion
*Onderwerp:* [MarkLogic Dev General] Processing Large Documents?



I have a 154Gig file representing a data dump from MySQL that I want to
load into MarkLogic and analyze.



When I use the flow editor to collect/load this file into an empty
database, it takes 33 seconds.



When I add two delete element transforms to the flow the load fails with a
timeout error after several minutes. One was to remove <table_structure/>,
as this schema information isn't necessary for my analysis.  The second
removed elements with empty contents using the *[not(text())] xpath
expression.



I gather from this that the transform phase does not operate on XML files
in a streaming mode.  Does there exist a custom transform that can work on
a stream of data, say by using Saxon's streaming functionality or a StAX
transformation?  I would expect an ETL tool to be able to handle large
files.



After loading this file huge file without the transform into MarkLogic, I
then wrote the following XQuery which when run in the Query Console was
able to delete these elements and perform an element name transformation as
this operation performed in 15 seconds and reduced the 154Gig file to
6Gigs.  This process handles the ETL functionality with great performance.



The original record reads:



 <table_data name="cli">
 <row>
  <field name="id">1</field>
  <field name="org_id">1</field>
 </row>
....



will be transformed into:



 <cli>
    <id>1</id>
    <org_id>1</org_id>
  </cli>
 ...



with this XQuery:



let $doc := element {/*/*/@name)} {
  for $row in /*/*/table_data/row
  return element {$row/../@name} {
    for $field in $row/field[text()]
    return element {$field/@name} {$field/text()}
  }
}
return xdmp:document-insert("{/*/*/@name}.xml", $doc)



My next step in this process is to write a transform which de-normalizes
the SQL tables into nested element structure and thus removing all the
primary/foreign keys which have no semantic purpose other than to identify
relationships.  I'd like to be able to automate this transformation using
the Information Center Flow Editor rather than doing it manually in the
Query Console.

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Processing Large Documents?

Reply via email to