Re: [Rails] What's the best way to approach reading and parse large XLSX files?

Ted Kaplan Tue, 25 Mar 2014 00:08:32 -0700

Creek is good, I'd also recommend dullard, a gem that I wrote.  Its output 
format may be more convenient for your case.


https://github.com/thirtyseven/dullard
http://rubygems.org/gems/dullard

-Ted

On Friday, October 11, 2013 1:35:39 PM UTC-7, Monserrat Foster wrote:
>
> I forgot to say after it reads all rows and writes the file, throws
>
>  [1m [35m (600.1ms) [0m  begin transaction
>    [1m [36m (52.0ms) [0m   [1mcommit transaction [0m
> failed to allocate memory
> Redirected to http://localhost:3000/upload_files/110
> Completed 406 Not Acceptable in 1207471ms (ActiveRecord: 693.1ms)
>
> On Friday, October 11, 2013 4:03:12 PM UTC-4:30, Monserrat Foster wrote:
>>
>> This is an everyday, initially maybe a couple people at the same time 
>> uploading and parsing files to generate the new one, but eventually it will 
>> extend to other people, so...
>>
>> I used a logger and It does retrieve and save the files using the 
>> comparation. But it takes forever, like 30min or so in generating the file. 
>> The process starts as soon as the files are uploaded but it seems to be 
>> taking most of the time into opening the file, once it's opened it takes 
>> maybe 5min at most to generate the new file.
>>
>> Do you know where can i find an example on how to read an xlsx file with 
>> nokogiri? I can't seem to find one
>>
>> On Friday, October 11, 2013 11:12:20 AM UTC-4:30, Walter Lee Davis wrote:
>>>
>>>
>>> On Oct 11, 2013, at 11:30 AM, Monserrat Foster wrote: 
>>>
>>> > One 30000+ row file and another with just over 200. How much memory 
>>> should I need for this not to take forever parsing? (I'm currently using my 
>>> computer as server and I can see ruby taking about 1GB in the task manager 
>>> when processing this (and it takes forever). 
>>> > 
>>> > The 30000+ row file is about 7MB, which is not that much (I think) 
>>>
>>> I have a collection of 1200 XML files, ranging in size from 3MB to 12MB 
>>> each (they're books, in TEI encoding) that I parse with Nokogiri on a 2GB 
>>> Joyent SmartMachine to convert them to XHTML and then on to Epub. This 
>>> process takes 17 minutes for the first pass, and 24 minutes for the second 
>>> pass. It does not crash, but the server is unable to do much of anything 
>>> else while the loop is running. 
>>>
>>> My question here was, is this something that is a self-serve web 
>>> service, or an admin-level (one-privileged-user-once-in-a-while) type 
>>> thing? In my case, there's one admin who adds maybe two or three books per 
>>> month to the collection, and the 40-minute do-everything loop was used only 
>>> for development purposes -- it was my test cycle as I checked all of the 
>>> titles against a validator to ensure that my adjustments to the transcoding 
>>> process didn't result in invalid code. I would not advise putting something 
>>> like this live against the world, as the potential for DOS is extremely 
>>> great. Anything that can pull the kinds of loads you get when you load a 
>>> huge file into memory and start fiddling with it should not be public! 
>>>
>>> Walter 
>>>
>>> > 
>>> > On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis 
>>> wrote: 
>>> > 
>>> > On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote: 
>>> > 
>>> > > A coworker suggested I should use just basic OOP for this, to create 
>>> a class that reads files, and then another to load the files into memory. 
>>> Could please point me in the right direction for this (where can I read 
>>> about it)? I have no idea what's he talking about, as I've never done this 
>>> before. 
>>> > 
>>> > How many of these files are you planning to parse at any one time? Do 
>>> you have the memory on your server to deal with this load? I can see this 
>>> approach working, but getting slow and process-bound very quickly. Lots of 
>>> edge cases to deal with when parsing big uploaded files. 
>>> > 
>>> > Walter 
>>> > 
>>> > > 
>>> > > I'll look up nokogiri and SAX 
>>> > > 
>>> > > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis 
>>> wrote: 
>>> > > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
>>> > > 
>>> > > > Hello, I'm developing an app that basically, receives a 10MB or 
>>> less XLSX files with +30000 rows or so, and another XLSX file with about 
>>> 200rows, I have to read one row of the smallest file, look it up on the 
>>> largest file and write data from both files to a new one. 
>>> > > 
>>> > > Wow. Do you have to do all this in a single request? 
>>> > > 
>>> > > You may want to look at Nokogiri and its SAX parser. SAX parsers 
>>> don't care about the size of the document they operate on, because they 
>>> work one node at a time, and don't load the whole thing into memory at 
>>> once. There are some limitations on what kind of work a SAX parser can 
>>> perform, because it isn't able to see the entire document and "know" where 
>>> it is within the document at any point. But for certain kinds of problems, 
>>> it can be the only way to go. Sounds like you may need something like this. 
>>> > > 
>>> > > Walter 
>>> > > 
>>> > > > 
>>> > > > I just did a test reading a few rows from the largest file using 
>>> ROO (Spreadsheet doesn't support XSLX and Creek look good but I can't find 
>>> a way to read row by row) 
>>> > > > and it basically made my computer crash, the server crashed, I 
>>> tried rebooting it and it said It was already started, anyway, it was a 
>>> disaster. 
>>> > > > 
>>> > > > So, my question was, is there gem that works best with large XLSX 
>>> files or is there another way to approach this withouth crashing my 
>>> computer? 
>>> > > > 
>>> > > > This is what I had (It's very possible I'm doing it wrong, help is 
>>> welcome) 
>>> > > > What i was trying to do here, was to process the files and create 
>>> the new XLS file after both of the XLSX files were uploaded: 
>>> > > > 
>>> > > > 
>>> > > > require 'roo' 
>>> > > > require 'spreadsheet' 
>>> > > > require 'creek' 
>>> > > > class UploadFiles < ActiveRecord::Base 
>>> > > >   after_commit :process_files 
>>> > > >   attr_accessible :inventory, :material_list 
>>> > > >   has_one :inventory 
>>> > > >   has_one :material_list 
>>> > > >   has_attached_file :inventory, :url=>"/:current_user/inventory", 
>>> :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension"
>>>  
>>>
>>> > > >   has_attached_file :material_list, 
>>> :url=>"/:current_user/material_list", 
>>> :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension"
>>>  
>>>
>>> > > >   validates_attachment_presence :material_list 
>>> > > >   accepts_nested_attributes_for :material_list, :allow_destroy => 
>>> true   
>>> > > >   accepts_nested_attributes_for :inventory, :allow_destroy => true 
>>>   
>>> > > >   validates_attachment_content_type :inventory, :content_type => 
>>> ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], 
>>> :message => "Only .XSLX files are accepted as Inventory" 
>>> > > >   validates_attachment_content_type :material_list, :content_type 
>>> => ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], 
>>> :message => "Only .XSLX files are accepted as Material List" 
>>> > > >   
>>> > > >   
>>> > > >   def process_files 
>>> > > >     inventory =  Creek::Book.new(Rails.root.to_s + 
>>> "/tmp/users/uploaded_files/inventory/inventory.xlsx") 
>>> > > >     material_list = Creek::Book.new(Rails.root.to_s + 
>>> "/tmp/users/uploaded_files/material_list/material_list.xlsx") 
>>> > > >     inventory = inventory.sheets[0] 
>>> > > >     scl = Spreadsheet::Workbook.new 
>>> > > >     sheet1 = scl.create_worksheet 
>>> > > >     inventory.rows.each do |row| 
>>> > > >       row.inspect 
>>> > > >       sheet1.row(1).push(row) 
>>> > > >     end 
>>> > > >     
>>> > > >     sheet1.name = "Site Configuration List" 
>>> > > >     scl.write(Rails.root.to_s + 
>>> "/tmp/users/generated/siteconfigurationlist.xls") 
>>> > > >   end 
>>> > > > end 
>>> > > > 
>>> > > > 
>>> > > > -- 
>>> > > > You received this message because you are subscribed to the Google 
>>> Groups "Ruby on Rails: Talk" group. 
>>> > > > To unsubscribe from this group and stop receiving emails from it, 
>>> send an email to [email protected]. 
>>> > > > To post to this group, send email to [email protected]. 
>>>
>>> > > > To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com.
>>>  
>>>
>>> > > > For more options, visit https://groups.google.com/groups/opt_out. 
>>> > > 
>>> > > 
>>> > > -- 
>>> > > You received this message because you are subscribed to the Google 
>>> Groups "Ruby on Rails: Talk" group. 
>>> > > To unsubscribe from this group and stop receiving emails from it, 
>>> send an email to [email protected]. 
>>> > > To post to this group, send email to [email protected]. 
>>> > > To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com.
>>>  
>>>
>>> > > For more options, visit https://groups.google.com/groups/opt_out. 
>>> > 
>>> > 
>>> > -- 
>>> > You received this message because you are subscribed to the Google 
>>> Groups "Ruby on Rails: Talk" group. 
>>> > To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected]. 
>>> > To post to this group, send email to [email protected]. 
>>> > To view this discussion on the web visit 
>>> https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com.
>>>  
>>>
>>> > For more options, visit https://groups.google.com/groups/opt_out. 
>>>
>>>

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/d769eb23-b880-4234-8cd9-79cd3df0ca38%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

Reply via email to