[julia-users] Re: Crashing while parsing large XML file

Diego Javier Zea Thu, 28 Jan 2016 10:32:51 -0800

Hi! 

Why you are using


for line in eachline(f)  l = readline(f)


instead of

for l in eachline(f)


?

Best

El jueves, 28 de enero de 2016, 12:42:35 (UTC-3), Brandon Booth escribió:
>
> I'm parsing an XML file that's about 30gb and wrote the loop below to 
> parse it line by line. My code cycles through each line and builds a 1x200 
> dataframe that is appended to a larger dataframe. When the larger dataframe 
> gets to 1000 rows I stream it to an SQLite table. The code works for the 
> first 25 million or so lines (which equates to 125,000 or so records in the 
> SQLite table) and then freezes. I've tried it without the larger dataframe 
> but that didn't help.
>
> Any suggestions to avoid crashing?
>
> Thanks.
>
> Brandon
>
>
>
> The XML structure:
> <doc>
> <field1>value</field1>
> <field2>value>/field2>
> ...
> </doc>
> <doc>
> <field1>value</field1>
> <field2>value>/field2>
> ...
> </doc>
>
>
> My loop:
>
> f = open("contracts.xml","r")readline(f)n = countlines(f)tic()for line in 
> eachline(f)  l = readline(f)  if startswith(l,"<doc")    df = 
> DataFrame(df_types,df_names, 1)  elseif startswith(l,"</doc")    
> append!(df1,df)    if size(df1,1) == 1000      source = convertdf(df1)      
> Data.stream!(source,sink)      deleterows!(df1,1:1000)    end  else    str = 
> parse_string(l)    r = root(str)    df[symbol(name(r))] = string(content(r))  
> endend
>
> close(f)
>
>

[julia-users] Re: Crashing while parsing large XML file

Reply via email to