I'm parsing an XML file that's about 30gb and wrote the loop below to parse 
it line by line. My code cycles through each line and builds a 1x200 
dataframe that is appended to a larger dataframe. When the larger dataframe 
gets to 1000 rows I stream it to an SQLite table. The code works for the 
first 25 million or so lines (which equates to 125,000 or so records in the 
SQLite table) and then freezes. I've tried it without the larger dataframe 
but that didn't help.

Any suggestions to avoid crashing?

Thanks.

Brandon



The XML structure:
<doc>
<field1>value</field1>
<field2>value>/field2>
...
</doc>
<doc>
<field1>value</field1>
<field2>value>/field2>
...
</doc>


My loop:

f = open("contracts.xml","r")readline(f)n = countlines(f)tic()for line in 
eachline(f)  l = readline(f)  if startswith(l,"<doc")    df = 
DataFrame(df_types,df_names, 1)  elseif startswith(l,"</doc")    
append!(df1,df)    if size(df1,1) == 1000      source = convertdf(df1)      
Data.stream!(source,sink)      deleterows!(df1,1:1000)    end  else    str = 
parse_string(l)    r = root(str)    df[symbol(name(r))] = string(content(r))  
endend

close(f)

Reply via email to