Hello MarkLogic users, I have some big XML files, ranging from a few MB to a few hundred MB, and maybe larger. They are big matrices exported from Stata with 1-4k columns and ~20k rows. I need to normalize them and I would like to transform them into something a little more logical.
I can do most of what I need in XQuery. It works. I tried it on two rows. But then when I tried to collect another section of data that shouldn’t have been a big deal, it locked up. I wrote a SAX parser in C++ and libxml2 and it can parse the whole file in 20 seconds. But I want to have all of my files in MarkLogic and accessible. So how should I structure the XML I want to import? Each value of the original data has a path that looks like "/dta/data/o/v/@varname” where varname is the unique part defining a column. I would prefer to transform this to "/survey/observations/observation/varname”. I think that this would make it easier and more efficient to index just on this column, and not any other. It seems like an index on the attribute would be much less efficient. Plus, attributes have to be accessed via a filter instead of a column-unique path. Is my assumption correct? It seems like this task hinges on the subtleties of how MarkLogic creates and users indices. I don’t like relying on subtle knowledge. But then, I will need to access paths dynamically and that may require yet another filter, this time on local-name() instead of the attribute name. But due to the size and structure of these files, I think some pre-processing in C++ before MarkLogic really seems like a good idea. So what would MarkLogic users suggest? Should I stick with a generic element name like “v”, “value”, or “column” and rely on attribute value filters? Or should I use a column-unique element name like “dr”, “numero”, “region”, etc. where I can, knowing that I may still need a local-name() filter for dynamic selection? My output queries will return all rows and select columns, denormalized again, in CSV for now and hopefully Rdata later. Thanks John Daniel Geospatial Software Architect Institute for Health Metrics and Evaluation | University of Washington 2301 5th Avenue, Suite 600 | Seattle, WA 98121 Tel: +1-206-897-2862 | UW Campus Mailbox: 358210 jwdan...@uw.edu<mailto:jwdan...@uw.edu> | http://www.healthdata.org<http://www.healthdata.org/>
_______________________________________________ General mailing list General@developer.marklogic.com Manage your subscription at: http://developer.marklogic.com/mailman/listinfo/general