I'm parsing an XML file that's about 30gb and wrote the loop below to parse
it line by line. My code cycles through each line and builds a 1x200
dataframe that is appended to a larger dataframe. When the larger dataframe
gets to 1000 rows I stream it to an SQLite table. The code works for the
first 25 million or so lines (which equates to 125,000 or so records in the
SQLite table) and then freezes. I've tried it without the larger dataframe
but that didn't help.
Any suggestions to avoid crashing?
Thanks.
Brandon
The XML structure:
<doc>
<field1>value</field1>
<field2>value>/field2>
...
</doc>
<doc>
<field1>value</field1>
<field2>value>/field2>
...
</doc>
My loop:
f = open("contracts.xml","r")readline(f)n = countlines(f)tic()for line in
eachline(f) l = readline(f) if startswith(l,"<doc") df =
DataFrame(df_types,df_names, 1) elseif startswith(l,"</doc")
append!(df1,df) if size(df1,1) == 1000 source = convertdf(df1)
Data.stream!(source,sink) deleterows!(df1,1:1000) end else str =
parse_string(l) r = root(str) df[symbol(name(r))] = string(content(r))
endend
close(f)