This is my second day spent working with MarkLogic, having just come back this week from XMLPrague. So everything in my system to date is default configuration, straight out of the box. I have seen the "Fragment Roots" and "Fragment Parents" nodes listed under my database in the configure database view. I've not done enough research to know yet how these should be configured. I'm getting the impression that MarkLogic requires more custom tuning than I've had to do with eXistDB.
Again, to date I've loaded the large file and done my first transformation using XQuery and this seems to perform well. Perhaps my next transformation should chunk this single file into multiple files. I assume I can do this with XQuery. It appears that MarkLogic doesn't implement the XQuery Update Facility recommendation but instead has proprietary functions to accomplish inserts and updates. What I've found so far is the xdmp:document-insert() function, which I believe I can call multiple times within a for iterator within XQuery. This is how I think I'd like to accomplish the chunking. As for indexing, I've been accustomed (with eXistDB) when doing full-text searches for words to receive results returning the parent/ancestor element which contain the word, not the fragment root as represented by the chunked file. The structure is part of the index so when the word is found, the parent and ancestors elements can be immediately identified. I believe the same is possible with xDB. Does MarkLogic perform in a similar fashion? In the past I've worked with large documents that were not 'record' oriented but were instead technical documents with a deep hierarchy consisting of chapter/section/subject/paragraphs. Determining the appropriate level to chunk this file is more difficult than a classic database/table/row/field hierarchy. The "container" nodes in this case actually contain distingishing attributes which make them necessary to maintain. I would like to find a system that can handle deep hierarchies without penalizing performance. On Mon, Feb 20, 2012 at 12:12 AM, Geert Josten <[email protected]>wrote: > Hi Todd, > > > > I know a few tricks that could help getting this done with information > studio. One of which is putting your XQuery in a custom XQuery transform. > But you need to copy things like collection from the input file, and some > other properties as well, to make sure resulting files are treated properly > in the flow. > > > > But first: are you using the database’s fragmentation options to load your > 154Gb file? > > > > Kind regards, > > Geert > > > > *Van:* [email protected] [mailto: > [email protected]] *Namens *Todd Gochenour > *Verzonden:* maandag 20 februari 2012 2:00 > *Aan:* MarkLogic Developer Discussion > *Onderwerp:* [MarkLogic Dev General] Processing Large Documents? > > > > I have a 154Gig file representing a data dump from MySQL that I want to > load into MarkLogic and analyze. > > > > When I use the flow editor to collect/load this file into an empty > database, it takes 33 seconds. > > > > When I add two delete element transforms to the flow the load fails with a > timeout error after several minutes. One was to remove <table_structure/>, > as this schema information isn't necessary for my analysis. The second > removed elements with empty contents using the *[not(text())] xpath > expression. > > > > I gather from this that the transform phase does not operate on XML files > in a streaming mode. Does there exist a custom transform that can work on > a stream of data, say by using Saxon's streaming functionality or a StAX > transformation? I would expect an ETL tool to be able to handle large > files. > > > > After loading this file huge file without the transform into MarkLogic, I > then wrote the following XQuery which when run in the Query Console was > able to delete these elements and perform an element name transformation as > this operation performed in 15 seconds and reduced the 154Gig file to > 6Gigs. This process handles the ETL functionality with great performance. > > > > The original record reads: > > > > <table_data name="cli"> > <row> > <field name="id">1</field> > <field name="org_id">1</field> > </row> > .... > > > > will be transformed into: > > > > <cli> > <id>1</id> > <org_id>1</org_id> > </cli> > ... > > > > with this XQuery: > > > > let $doc := element {/*/*/@name)} { > for $row in /*/*/table_data/row > return element {$row/../@name} { > for $field in $row/field[text()] > return element {$field/@name} {$field/text()} > } > } > return xdmp:document-insert("{/*/*/@name}.xml", $doc) > > > > My next step in this process is to write a transform which de-normalizes > the SQL tables into nested element structure and thus removing all the > primary/foreign keys which have no semantic purpose other than to identify > relationships. I'd like to be able to automate this transformation using > the Information Center Flow Editor rather than doing it manually in the > Query Console. > > _______________________________________________ > General mailing list > [email protected] > http://developer.marklogic.com/mailman/listinfo/general > >
_______________________________________________ General mailing list [email protected] http://developer.marklogic.com/mailman/listinfo/general
