Re: [MarkLogic Dev General] csv load

David Ennis Tue, 18 Feb 2014 00:29:31 -0800

HI.

In general, the challenge comes in when you have a dump to CSV where therecords were created by joins across tables and a percentage of thefields are duplicate data. A good example may be a sales system wherethere is an order header and then a records for each item sold.. Youwill often see this in CSV where much of the header record has beenappended to each line of the "items sold" lines.


So, in this case, a clean XML may theoritically look like:
<order>
<header>
.... stufff from the header record
</header>
<items>
.... all of the items sold as <item> elements
</item>
</items>

This requires post processing (or linear processing using old-fashinedcontrol-break routines if the file is ordered properly)

That's the challenge - first know how you want the data represented inXML format.

Then the next challenge is what to do for the export.. How to exportand get the same results..

----------------------------------------------------------

In your case, the data 'looks' simple where you could create an XMLrecord per CSV line with each field bing an element (then you still needto decide - one document per line or one document per export or onedocument per program, etc..


Retuning the data:

For your records, if as simple as it seems, a XSLT transformation willdo it nicely or what I like is using maps for this type of stuff.


Lastly - a 'warning' about CSV..

There are different ways of escaping CSV and in some cases multi-lineCSV files are also acceptable. In all cases, CSV does not care aboutescaping items useful for XML '>', for example. If you use one of thepreviously mentioned programs to get your CSV into XML Format, just beaware of what it may do to your data on the way in to make it XML-safeso that you can account for it on the way out.


-David



On 18/02/14 08:30, Erik van der Hoeven wrote:


David,


Thats are interesting questions.

At this moment i catch Twitterstreams of the Netherlands. ThoseTwitter Message wil be automaticly insert into ML Database (Social DB). I have placed seperate range indexes ( on element because temessages is in XML format). Then i created a view so it is easy to geta overview of all users, messages and locations. But now i won't tointegrate it with the Electronic Program Guide (CSV file) . So ithought maby when i have two views i can easely integrate and combine it.



The CSV file consist of the following structure

Program;Date From, Date Until; Chanel ; #Usergroep


The question is. How can i do this on the most efficient way ?



Met vriendelijke groeten,
//
/Erik van der Hoeven /
/Consultant Business Intelligence/

DIKW CONSULTING BV
Einsteinbaan 12
3439 NJ Nieuwegein
M: 06-43029943
E: [email protected] <mailto:[email protected]>

On Mon, Feb 17, 2014 at 6:49 PM, David Lee <[email protected]<mailto:[email protected]>> wrote:


    Yes csv2xml and xmlsh are great for this.

    The problem with csv  to xml to ml and with "uber" tools in
    general is that its more complicated than it looks at first sight.

    Technically its easy - IF you let the developer decide for you all
    the details.  Which are never what you want.

    This is the same issue *exactly* as importing from SQL.

    When you can't to load CSV to ML .. the first step you should
    think about is not how to get CSV into XML ,

    but how do I want my document structure to look ?

    Does the CSV file become one big XML doc ? One doc per row ? do
    the values go into attributes ? elements ? both ?

    Where to get the names of the values ? CSV Headers ? What if those
    are not good  XML names (QNames) ?

    Do you need to merge in different data to de-normalize your docs ?
    (very common for CSV to be part of a package of CSV files,

    or for it to contain duplicate rows to represent hierarchies) -
    this requires post-processing of the entire result set.

    So first ... think how do you want your final docs to look.  Then
    think ... how to load them.

    Is this a one-off small CSV file ? a HUGE file (GB+) ? will it
    create millions of docs or 10s ... how important is it to load fast ?

    All these are considerations that take different approaches.


    OK you figured it all out ... the tools are all there you just
    need to either pick one that by amazing grace picked for you all

    the details exactly how you wanted, or you have to glue something
    together to do it your way.

    You could load the CSV to the server then do all the transforming
    and reloading there,

    or you can preprocess it just so and then push it to the server
    exactly how you want it.

    Both are valid, but I suggest pre-processing the docs is easier
    and often faster ...

    but it depends on your skills and tools ... and also the sizes of
    the data ... and what you're doing with it.

    This is what xmlsh excels at.  Instead of trying to do one thing
    ... it lets you split the problem into manageable pieces.

    Once you figured out your document design ... you can glue it
    together with xmlsh

    1)Get the CSV into SOME kind of XML.  Nothing fancy but something
    .... so you can use xml tools.

    csv2xml has many options to control this ... a common one is

    csv2xml -header

    This will create a  single rooted document <root> with rows <row>
    and child elements <element> for each cell where the tag names

    for the cell are created by converting the header columns into QNames.

    It's a reasonable first start ..

    Then suppose you want each row turned into its own document -
    xsplit to the rescue

    http://www.xmlsh.org/CommandXsplit

    xsplit is particularly good on this structure document (a root
    element with repeating children)

    By default it will create files with ugly names like x1.xml, x2.xml.

    If you want to rename them based on something in the document you
    could then run xmove

    http://www.xmlsh.org/CommandXmove

    Now the files probably need some tweeking so you might want to run
    an xslt or xquery on them to fix them up

    http://www.xmlsh.org/CommandXslt

    http://www.xmlsh.org/CommandXquery

    Now you have a directory of files ready to upload ...

    the put command can do this

    http://www.xmlsh.org/MarkLogicPut

    Or you can use the excellent tool mlcp

    https://developer.marklogic.com/products/mlcp

    so the whole process would look as simple as this

    csv2xml < file.csv | ml:put -uri /myfile.xml

    To something  more realistic

    csv2xml -header  < file.csv  | xslt -f translate.xsl | xsplit  -n
    -o temp

    xmove -x /row/account_id *.xml

    ml:put -baseuri /accounts -maxthreads 10 -maxfiles 100 -collection
    mycollect *.xml

    And if you get really fancy you can actually stream this all and
    avoid temporary files, but it's a bit trickier.

    Amway ... lots of ways to skin the cat !

    *From:*[email protected]
    <mailto:[email protected]>
    [mailto:[email protected]
    <mailto:[email protected]>] *On Behalf Of
    *Jakob Fix
    *Sent:* Monday, February 17, 2014 11:17 AM


    *To:* MarkLogic Developer Discussion
    *Subject:* Re: [MarkLogic Dev General] csv load

    Hi,

    http://www.xmlsh.org/CommandCsv2xml (never used it myself, but it
    seems to do what you're looking for); note though that you would
    have to add a loading task after it which is also available via
    xmlsh. I'm sure David Lee can explain this more eloquently.


    cheers,
    Jakob.

    On Mon, Feb 17, 2014 at 5:00 PM, Erik van der Hoeven
    <[email protected]
    <mailto:[email protected]>> wrote:

    Gentlemen,

    Does any body nows a way to load a csv file into Marklogic Database ?


    Met vriendelijke groeten/With kind regards,

    /Erik van der Hoeven /

    /Consultant Business Intelligence/


    DIKW CONSULTING BV
    Einsteinbaan 12
    3439 NJ Nieuwegein
    M: 06-43029943

    E: [email protected] <mailto:[email protected]>


    _______________________________________________
    General mailing list
    [email protected]
    <mailto:[email protected]>
    http://developer.marklogic.com/mailman/listinfo/general


    _______________________________________________
    General mailing list
    [email protected]
    <mailto:[email protected]>
    http://developer.marklogic.com/mailman/listinfo/general




_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general



--

David Ennis
Content Engineer

HintTech Mastering the value of content <http://www.hinttech.com>
Mastering the value of content
creative | technology |content

Delftechpark 37i
2628 XJ Delft
The Netherlands
T:      +31 88 268 25 00

M:      +31 6 000 000 00

Website <http://www.hinttech.com> Twitter<https://twitter.com/HintTech> Facebook<http://www.facebook.com/HintTech> LinkedIn<http://www.linkedin.com/company/HintTech>


HintTech Mastering the value of content <http://www.dayon.nl>

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Re: [MarkLogic Dev General] csv load

Reply via email to