On 10/11/2013 11:30 AM, Monserrat Foster wrote:
One 30000+ row file and another with just over 200. How much memory should I need for this not to take forever parsing? (I'm currently using my computer as server and I can see ruby taking about 1GB in the task manager when processing this (and it takes forever).

The 30000+ row file is about 7MB, which is not that much (I think)

On Friday, October 11, 2013 8:44:22 AM UTC-4:30, Walter Lee Davis wrote:


    On Oct 10, 2013, at 4:50 PM, Monserrat Foster wrote:

    > A coworker suggested I should use just basic OOP for this, to
    create a class that reads files, and then another to load the
    files into memory. Could please point me in the right direction
    for this (where can I read about it)? I have no idea what's he
    talking about, as I've never done this before.

    How many of these files are you planning to parse at any one time?
    Do you have the memory on your server to deal with this load? I
    can see this approach working, but getting slow and process-bound
    very quickly. Lots of edge cases to deal with when parsing big
    uploaded files.

    Walter

    >
    > I'll look up nokogiri and SAX
    >
    > On Thursday, October 10, 2013 4:12:33 PM UTC-4:30, Walter Lee
    Davis wrote:
    > On Oct 10, 2013, at 4:36 PM, Monserrat Foster wrote:
    >
    > > Hello, I'm developing an app that basically, receives a 10MB
    or less XLSX files with +30000 rows or so, and another XLSX file
    with about 200rows, I have to read one row of the smallest file,
    look it up on the largest file and write data from both files to a
    new one.
    >
    > Wow. Do you have to do all this in a single request?
    >
    > You may want to look at Nokogiri and its SAX parser. SAX parsers
    don't care about the size of the document they operate on, because
    they work one node at a time, and don't load the whole thing into
    memory at once. There are some limitations on what kind of work a
    SAX parser can perform, because it isn't able to see the entire
    document and "know" where it is within the document at any point.
    But for certain kinds of problems, it can be the only way to go.
    Sounds like you may need something like this.
    >
    > Walter
    >
    > >
    > > I just did a test reading a few rows from the largest file
    using ROO (Spreadsheet doesn't support XSLX and Creek look good
    but I can't find a way to read row by row)
    > > and it basically made my computer crash, the server crashed, I
    tried rebooting it and it said It was already started, anyway, it
    was a disaster.
    > >
    > > So, my question was, is there gem that works best with large
    XLSX files or is there another way to approach this withouth
    crashing my computer?
    > >
    > > This is what I had (It's very possible I'm doing it wrong,
    help is welcome)
    > > What i was trying to do here, was to process the files and
    create the new XLS file after both of the XLSX files were uploaded:
    > >
    > >
    > > require 'roo'
    > > require 'spreadsheet'
    > > require 'creek'
    > > class UploadFiles < ActiveRecord::Base
    > >   after_commit :process_files
    > >   attr_accessible :inventory, :material_list
    > >   has_one :inventory
    > >   has_one :material_list
    > >   has_attached_file :inventory,
    :url=>"/:current_user/inventory",
    :path=>":rails_root/tmp/users/uploaded_files/inventory/inventory.:extension"

    > >   has_attached_file :material_list,
    :url=>"/:current_user/material_list",
    
:path=>":rails_root/tmp/users/uploaded_files/material_list/material_list.:extension"

    > >   validates_attachment_presence :material_list
    > >   accepts_nested_attributes_for :material_list, :allow_destroy
    => true
    > >   accepts_nested_attributes_for :inventory, :allow_destroy =>
    true
    > >   validates_attachment_content_type :inventory, :content_type
    =>
    ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"],
    :message => "Only .XSLX files are accepted as Inventory"
    > >   validates_attachment_content_type :material_list,
    :content_type =>
    ["application/vnd.openxmlformats-officedocument.spreadsheetml.sheet"],
    :message => "Only .XSLX files are accepted as Material List"
    > >
    > >
    > >   def process_files
    > >     inventory =  Creek::Book.new(Rails.root.to_s +
    "/tmp/users/uploaded_files/inventory/inventory.xlsx")
    > >     material_list = Creek::Book.new(Rails.root.to_s +
    "/tmp/users/uploaded_files/material_list/material_list.xlsx")
    > >     inventory = inventory.sheets[0]
    > >     scl = Spreadsheet::Workbook.new
    > >     sheet1 = scl.create_worksheet
    > >     inventory.rows.each do |row|
    > >       row.inspect
    > >       sheet1.row(1).push(row)
    > >     end
    > >
    > > sheet1.name <http://sheet1.name> = "Site Configuration List"
    > >     scl.write(Rails.root.to_s +
    "/tmp/users/generated/siteconfigurationlist.xls")
    > >   end
    > > end
    > >
    > >
    > > --
    > > You received this message because you are subscribed to the
    Google Groups "Ruby on Rails: Talk" group.
    > > To unsubscribe from this group and stop receiving emails from
    it, send an email to [email protected].
    > > To post to this group, send email to
    [email protected].
    > > To view this discussion on the web visit
    
https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com
    
<https://groups.google.com/d/msgid/rubyonrails-talk/bc470d4d-19c4-4969-8ba7-4ead7a35d40c%40googlegroups.com>.

    > > For more options, visit
    https://groups.google.com/groups/opt_out
    <https://groups.google.com/groups/opt_out>.
    >
    >
    > --
    > You received this message because you are subscribed to the
    Google Groups "Ruby on Rails: Talk" group.
    > To unsubscribe from this group and stop receiving emails from
    it, send an email to [email protected]
    <javascript:>.
    > To post to this group, send email to
    [email protected] <javascript:>.
    > To view this discussion on the web visit
    
https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com
    
<https://groups.google.com/d/msgid/rubyonrails-talk/0325dc87-0649-45fc-9d55-0fbcd8bed0a0%40googlegroups.com>.

    > For more options, visit https://groups.google.com/groups/opt_out
    <https://groups.google.com/groups/opt_out>.

--
You received this message because you are subscribed to the Google Groups "Ruby on Rails: Talk" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit https://groups.google.com/d/msgid/rubyonrails-talk/ba633f69-5527-4dc1-8518-b6104e414e15%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
I use a rather indirect route that works fine for me with 15,000 lines and about 26 MB. I export the file from LibreOffice Calc using csv (Comma separated variables). Then, in the rails controller I use something like:

require 'csv'

class TheControllerController # ;')

# other controller code

  def upload
data = CSV.parse(params[:entries].tempfile.read) # from Ruby's CSV class
    for line in data do
      logger.debug "line: #{line.inspect}"
#each line is an array of strings containing the columns of the one row of the csv file #I use these data to populate the appropriate db table / rails model at this point
    end
  end

end

make sure that your routes.db points to this:

  match 'the_controller/upload' => 'the_controller#upload'

from your client machine's command line

curl -F [email protected] localhost:3000/the_controller/upload

note that 'entries' in the curl command matches the 'entries' in the param[:entries] in the controller.

If you want to do this from a rails gui form, look at http://guides.rubyonrails.org/form_helpers.html#uploading-files

During testing on my 4-core, 8 GB laptop, processing the really big files take several minutes. When I have the app on heroku, this causes a timeout so I break up the csv file into multiple sections such that each section takes less than 30 seconds to upload. By leaving a little 'slack' in the size, I have this automated so it occurs in the background while I am doing other work.

Hope these suggestions help.

Don Ziesig






--
You received this message because you are subscribed to the Google Groups "Ruby on 
Rails: Talk" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/rubyonrails-talk/52582197.6080105%40ziesig.org.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to