[MarkLogic Dev General] How to structure big XML files for fastest access?

John Daniel Fri, 06 Jan 2017 13:40:30 -0800

Hello MarkLogic users,
I have some big XML files, ranging from a few MB to a few hundred MB, and maybe 
larger. They are big matrices exported from Stata with 1-4k columns and ~20k 
rows. I need to normalize them and I would like to transform them into 
something a little more logical.


I can do most of what I need in XQuery. It works. I tried it on two rows. But 
then when I tried to collect another section of data that shouldn’t have been a 
big deal, it locked up. I wrote a SAX parser in C++ and libxml2 and it can 
parse the whole file in 20 seconds. But I want to have all of my files in 
MarkLogic and accessible. So how should I structure the XML I want to import?

Each value of the original data has a path that looks like 
"/dta/data/o/v/@varname” where varname is the unique part defining a column. I 
would prefer to transform this to "/survey/observations/observation/varname”. I 
think that this would make it easier and more efficient to index just on this 
column, and not any other. It seems like an index on the attribute would be 
much less efficient. Plus, attributes have to be accessed via a filter instead 
of a column-unique path. Is my assumption correct?

It seems like this task hinges on the subtleties of how MarkLogic creates and 
users indices. I don’t like relying on subtle knowledge. But then, I will need 
to access paths dynamically and that may require yet another filter, this time 
on local-name() instead of the attribute name. But due to the size and 
structure of these files, I think some pre-processing in C++ before MarkLogic 
really seems like a good idea. So what would MarkLogic users suggest? Should I 
stick with a generic element name like “v”, “value”, or “column” and rely on 
attribute value filters? Or should I use a column-unique element name like 
“dr”, “numero”, “region”, etc. where I can, knowing that I may still need a 
local-name() filter for dynamic selection?

My output queries will return all rows and select columns, denormalized again, 
in CSV for now and hopefully Rdata later.

Thanks


John Daniel
Geospatial Software Architect
Institute for Health Metrics and Evaluation | University of Washington
2301 5th Avenue, Suite 600 | Seattle, WA 98121
Tel: +1-206-897-2862 | UW Campus Mailbox: 358210
jwdan...@uw.edu<mailto:jwdan...@uw.edu> | 
http://www.healthdata.org<http://www.healthdata.org/>

_______________________________________________
General mailing list
General@developer.marklogic.com
Manage your subscription at: 
http://developer.marklogic.com/mailman/listinfo/general

[MarkLogic Dev General] How to structure big XML files for fastest access?

Reply via email to