Re: [MarkLogic Dev General] Using Information Studio to split uploaded files..

Geert Josten Tue, 24 Jan 2012 04:48:43 -0800

Hi Dave,



Interesting work! I used xsl:result-document to do something similar, which
went wrong for the same reasons. I commented on the blog a different
solution that (essentially) helps both approaches.



I am also investigating the possibilities of splitting in the collector,
which seems to most sensible place to do so. But my testcase fails. Perhaps
550k records is a bit too much to start with.. ;)



Kind regards,

Geert



*Van:* general-boun...@developer.marklogic.com [ma
ilto:general-boun...@developer.marklogic.com] *Namens *Dave Cassel
*Verzonden:* dinsdag 24 januari 2012 13:35
*Aan:* General MarkLogic Developer Discussion
*Onderwerp:* Re: [MarkLogic Dev General] Using Information Studio to split
uploaded files..



Late to the party, but I'll put it out there in case it's still helpful:
http://blog.davidcassel.net/2011/06/splitting-data-with-info-studio/



On Jan 17, 2012, at 2:31 PM, Michael Sokolov wrote:



Here's my two cents; I hope it helps with the development of IS.

We have typically been doing this kind of splitting external to
MarkLogic in xslt or in  java (with a sax parser) for very large
documents we need to stream.  Generally speaking, an xpath can describe
the boundaries where we want to split the original document - often an
element name or name/attribute-value combination would be enough.

One difficulty in the streaming case has been the need to maintain outer
context when splitting inner elements.  For example, consider a book
document where you want to split on book parts; a book-part can be the
whole book, a part, chapter, section, etc.  In a hierarchical structure
you mostly just want the part you're looking at, but also need to
preserve some outer metadata and/or structure as well - for example you
might like to include the book title in every part of the book so that
you can display that later.  Other typical requirements are to generate
a TOC and to maintain next/previous links between parts.

One approach that has helped us is to generate an intermediate document
including the part wrapped in its ancestors' descendant content *until
the next part boundary*.  Actually in the streaming case you can only
include ancestor descendant content that precedes the current chunk, but
since metadata typically precedes content, that seems to work out OK.

I would encourage you to consider providing or at least enabling some
solution to this ancestor-metadata problem as a requirement in any
document-splitting pipeline.

-Mike

On 1/13/2012 11:26 AM, Justin Makeig wrote:

Geert,

Information Studio is currently designed for single document in, single
document out transformations. Your best bet for splitting a document today
is to do this as part of the collection step.

Can you tell me a little more about the data you’d like to split? Is it
aggregated XML that you’re splitting on an XPath-like match expression?
Text separated by line breaks? Something else? I’m interested in figuring
out if and how we might make splitting easier and better integrated into
the product.



Justin



Justin Makeig

Senior Product Manager

MarkLogic Corporation

justin.mak...@marklogic.com

Phone: +1 650 655 2387

www.marklogic.com



On Jan 13, 2012, at 6:35 AM, Geert Josten wrote:



Hi,



Is Information Studio intended to allow splitting of uploaded files? If

so, what is the best way of handling that?



I was experimenting with a custom XSLT, and a simple xsl:result-document,

but that is giving funny results. Mostly

http://marklogic.com/states/appservices/distribute-error messages in the

errorlog, not sure what they exactly mean, but I can imagine it is because

CPF handling is 'violated' or something..



Any suggestions?



Kind regards,

Geert



drs. G.P.H. (Geert) Josten

Senior Developer







Dayon B.V.

Delftechpark 37b

2628 XJ Delft



T +31 (0)88 26 82 570



geert.jos...@dayon.nl

www.dayon.nl



De informatie - verzonden in of met dit e-mailbericht - is afkomstig van

Dayon BV en is uitsluitend bestemd voor de geadresseerde. Indien u dit

bericht onbedoeld hebt ontvangen, verzoeken wij u het te verwijderen. Aan

dit bericht kunnen geen rechten worden ontleend.

_______________________________________________

General mailing list

General@developer.marklogic.com

http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________

General mailing list

General@developer.marklogic.com

http://developer.marklogic.com/mailman/listinfo/general


_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general



*
David Cassel*

dave.cas...@marklogic.com

Sr. Federal Consultant

MarkLogic Corporation <http://marklogic.com>

_______________________________________________
General mailing list
General@developer.marklogic.com
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] Using Information Studio to split uploaded files..

Reply via email to