HI.
In general, the challenge comes in when you have a dump to CSV where the
records were created by joins across tables and a percentage of the
fields are duplicate data. A good example may be a sales system where
there is an order header and then a records for each item sold.. You
will often see this in CSV where much of the header record has been
appended to each line of the "items sold" lines.
So, in this case, a clean XML may theoritically look like:
<order>
<header>
.... stufff from the header record
</header>
<items>
.... all of the items sold as <item> elements
</item>
</items>
This requires post processing (or linear processing using old-fashined
control-break routines if the file is ordered properly)
That's the challenge - first know how you want the data represented in
XML format.
Then the next challenge is what to do for the export.. How to export
and get the same results..
----------------------------------------------------------
In your case, the data 'looks' simple where you could create an XML
record per CSV line with each field bing an element (then you still need
to decide - one document per line or one document per export or one
document per program, etc..
Retuning the data:
For your records, if as simple as it seems, a XSLT transformation will
do it nicely or what I like is using maps for this type of stuff.
Lastly - a 'warning' about CSV..
There are different ways of escaping CSV and in some cases multi-line
CSV files are also acceptable. In all cases, CSV does not care about
escaping items useful for XML '>', for example. If you use one of the
previously mentioned programs to get your CSV into XML Format, just be
aware of what it may do to your data on the way in to make it XML-safe
so that you can account for it on the way out.
-David
On 18/02/14 08:30, Erik van der Hoeven wrote:
David,
Thats are interesting questions.
At this moment i catch Twitterstreams of the Netherlands. Those
Twitter Message wil be automaticly insert into ML Database (Social DB)
. I have placed seperate range indexes ( on element because te
messages is in XML format). Then i created a view so it is easy to get
a overview of all users, messages and locations. But now i won't to
integrate it with the Electronic Program Guide (CSV file) . So i
thought maby when i have two views i can easely integrate and combine it.
The CSV file consist of the following structure
Program;Date From, Date Until; Chanel ; #Usergroep
The question is. How can i do this on the most efficient way ?
Met vriendelijke groeten,
//
/Erik van der Hoeven /
/Consultant Business Intelligence/
DIKW CONSULTING BV
Einsteinbaan 12
3439 NJ Nieuwegein
M: 06-43029943
E: [email protected] <mailto:[email protected]>
On Mon, Feb 17, 2014 at 6:49 PM, David Lee <[email protected]
<mailto:[email protected]>> wrote:
Yes csv2xml and xmlsh are great for this.
The problem with csv to xml to ml and with "uber" tools in
general is that its more complicated than it looks at first sight.
Technically its easy - IF you let the developer decide for you all
the details. Which are never what you want.
This is the same issue *exactly* as importing from SQL.
When you can't to load CSV to ML .. the first step you should
think about is not how to get CSV into XML ,
but how do I want my document structure to look ?
Does the CSV file become one big XML doc ? One doc per row ? do
the values go into attributes ? elements ? both ?
Where to get the names of the values ? CSV Headers ? What if those
are not good XML names (QNames) ?
Do you need to merge in different data to de-normalize your docs ?
(very common for CSV to be part of a package of CSV files,
or for it to contain duplicate rows to represent hierarchies) -
this requires post-processing of the entire result set.
So first ... think how do you want your final docs to look. Then
think ... how to load them.
Is this a one-off small CSV file ? a HUGE file (GB+) ? will it
create millions of docs or 10s ... how important is it to load fast ?
All these are considerations that take different approaches.
OK you figured it all out ... the tools are all there you just
need to either pick one that by amazing grace picked for you all
the details exactly how you wanted, or you have to glue something
together to do it your way.
You could load the CSV to the server then do all the transforming
and reloading there,
or you can preprocess it just so and then push it to the server
exactly how you want it.
Both are valid, but I suggest pre-processing the docs is easier
and often faster ...
but it depends on your skills and tools ... and also the sizes of
the data ... and what you're doing with it.
This is what xmlsh excels at. Instead of trying to do one thing
... it lets you split the problem into manageable pieces.
Once you figured out your document design ... you can glue it
together with xmlsh
1)Get the CSV into SOME kind of XML. Nothing fancy but something
.... so you can use xml tools.
csv2xml has many options to control this ... a common one is
csv2xml -header
This will create a single rooted document <root> with rows <row>
and child elements <element> for each cell where the tag names
for the cell are created by converting the header columns into QNames.
It's a reasonable first start ..
Then suppose you want each row turned into its own document -
xsplit to the rescue
http://www.xmlsh.org/CommandXsplit
xsplit is particularly good on this structure document (a root
element with repeating children)
By default it will create files with ugly names like x1.xml, x2.xml.
If you want to rename them based on something in the document you
could then run xmove
http://www.xmlsh.org/CommandXmove
Now the files probably need some tweeking so you might want to run
an xslt or xquery on them to fix them up
http://www.xmlsh.org/CommandXslt
http://www.xmlsh.org/CommandXquery
Now you have a directory of files ready to upload ...
the put command can do this
http://www.xmlsh.org/MarkLogicPut
Or you can use the excellent tool mlcp
https://developer.marklogic.com/products/mlcp
so the whole process would look as simple as this
csv2xml < file.csv | ml:put -uri /myfile.xml
To something more realistic
csv2xml -header < file.csv | xslt -f translate.xsl | xsplit -n
-o temp
xmove -x /row/account_id *.xml
ml:put -baseuri /accounts -maxthreads 10 -maxfiles 100 -collection
mycollect *.xml
And if you get really fancy you can actually stream this all and
avoid temporary files, but it's a bit trickier.
Amway ... lots of ways to skin the cat !
*From:*[email protected]
<mailto:[email protected]>
[mailto:[email protected]
<mailto:[email protected]>] *On Behalf Of
*Jakob Fix
*Sent:* Monday, February 17, 2014 11:17 AM
*To:* MarkLogic Developer Discussion
*Subject:* Re: [MarkLogic Dev General] csv load
Hi,
http://www.xmlsh.org/CommandCsv2xml (never used it myself, but it
seems to do what you're looking for); note though that you would
have to add a loading task after it which is also available via
xmlsh. I'm sure David Lee can explain this more eloquently.
cheers,
Jakob.
On Mon, Feb 17, 2014 at 5:00 PM, Erik van der Hoeven
<[email protected]
<mailto:[email protected]>> wrote:
Gentlemen,
Does any body nows a way to load a csv file into Marklogic Database ?
Met vriendelijke groeten/With kind regards,
/Erik van der Hoeven /
/Consultant Business Intelligence/
DIKW CONSULTING BV
Einsteinbaan 12
3439 NJ Nieuwegein
M: 06-43029943
E: [email protected] <mailto:[email protected]>
_______________________________________________
General mailing list
[email protected]
<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
<mailto:[email protected]>
http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general
--
David Ennis
Content Engineer
HintTech Mastering the value of content <http://www.hinttech.com>
Mastering the value of content
creative | technology |content
Delftechpark 37i
2628 XJ Delft
The Netherlands
T: +31 88 268 25 00
M: +31 6 000 000 00
Website <http://www.hinttech.com> Twitter
<https://twitter.com/HintTech> Facebook
<http://www.facebook.com/HintTech> LinkedIn
<http://www.linkedin.com/company/HintTech>
HintTech Mastering the value of content <http://www.dayon.nl>
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general