Re: [Rails] What's the best way to approach reading and parse large XLSX files?

Walter Lee Davis Fri, 11 Oct 2013 06:15:34 -0700

On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote:

> A coworker suggested I should use just basic OOP for this, to create a class 
> that reads files, and then another to load the files into memory. Could 
> please point me in the right direction for this (where can I read about it)? 
> I have no idea what's he talking about, as I've never done this before.


How many of these files are you planning to parse at any one time? Do you have 
the memory on your server to deal with this load? I can see this approach 
working, but getting slow and process-bound very quickly. Lots of edge cases to 
deal with when parsing big uploaded files.

Walter

> 
> I'll look up nokogiri and SAX
> 
> On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee Davis wrote:
> On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote: 
> 
> > Hello, I'm developing an app that basically, receives a 10MB or less XLSX 
> > files with +30000 rows or so, and another XLSX file with about 200rows, I 
> > have to read one row of the smallest file, look it up on the largest file 
> > and write data from both files to a new one. 
> 
> Wow. Do you have to do all this in a single request? 
> 
> You may want to look at Nokogiri and its SAX parser. SAX parsers don't care 
> about the size of the document they operate on, because they work one node at 
> a time, and don't load the whole thing into memory at once. There are some 
> limitations on what kind of work a SAX parser can perform, because it isn't 
> able to see the entire document and "know" where it is within the document at 
> any point. But for certain kinds of problems, it can be the only way to go. 
> Sounds like you may need something like this. 
> 
> Walter 
> 
> > 
> > I just did a test reading a few rows from the largest file using ROO 
> > (Spreadsheet doesn't support XSLX and Creek look good but I can't find a 
> > way to read row by row) 
> > and it basically made my computer crash, the server crashed, I tried 
> > rebooting it and it said It was already started, anyway, it was a disaster. 
> > 
> > So, my question was, is there gem that works best with large XLSX files or 
> > is there another way to approach this withouth crashing my computer? 
> > 
> > This is what I had (It's very possible I'm doing it wrong, help is welcome) 
> > What i was trying to do here, was to process the files and create the new 
> > XLS file after both of the XLSX files were uploaded: 
> > 
> > 
> > require 'roo' 
> > require 'spreadsheet' 
> > require 'creek' 
> > class UploadFiles < ActiveRecord::Base 
> >   after_commit :process_files 
> >   attr_accessible :inventory, :material_list 
> >   has_one :inventory 
> >   has_one :material_list 
> >   has_attached_file :inventory, :url=>"/:current_user/inventory", 
> > :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension"
> >  
> >   has_attached_file :material_list, :url=>"/:current_user/material_list", 
> > :path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension"
> >  
> >   validates_attachment_presence :material_list 
> >   accepts_nested_attributes_for :material_list, :allow_destroy => true   
> >   accepts_nested_attributes_for :inventory, :allow_destroy => true   
> >   validates_attachment_content_type :inventory, :content_type => 
> > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], 
> > :message => "Only .XSLX files are accepted as Inventory" 
> >   validates_attachment_content_type :material_list, :content_type => 
> > ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"], 
> > :message => "Only .XSLX files are accepted as Material List" 
> >   
> >   
> >   def process_files 
> >     inventory =  Creek::Book.new(Rails.root.to_s + 
> > "/tmp/users/uploaded_files/inventory/inventory.xlsx") 
> >     material_list = Creek::Book.new(Rails.root.to_s + 
> > "/tmp/users/uploaded_files/material_list/material_list.xlsx") 
> >     inventory = inventory.sheets[0] 
> >     scl = Spreadsheet::Workbook.new 
> >     sheet1 = scl.create_worksheet 
> >     inventory.rows.each do |row| 
> >       row.inspect 
> >       sheet1.row(1).push(row) 
> >     end 
> >     
> >     sheet1.name = "Site Configuration List" 
> >     scl.write(Rails.root.to_s + 
> > "/tmp/users/generated/siteconfigurationlist.xls") 
> >   end 
> > end 
> > 
> > 
> > -- 
> > You received this message because you are subscribed to the Google Groups 
> > "Ruby on Rails: Talk" group. 
> > To unsubscribe from this group and stop receiving emails from it, send an 
> > email to rubyonrails-ta...@googlegroups.com. 
> > To post to this group, send email to rubyonra...@googlegroups.com. 
> > To view this discussion on the web visit 
> > https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com.
> >  
> > For more options, visit https://groups.google.com/groups/opt_out. 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Ruby on Rails: Talk" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to rubyonrails-talk+unsubscr...@googlegroups.com.
> To post to this group, send email to rubyonrails-talk@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

-- 
You received this message because you are subscribed to the Google Groups "Ruby 
on Rails: Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to rubyonrails-talk+unsubscr...@googlegroups.com.
To post to this group, send email to rubyonrails-talk@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/8D1E231B-04BB-4721-B405-27F310874D91%40wdstudio.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [Rails] What's the best way to approach reading and parse large XLSX files?

Reply via email to