Hi,

to finally cope with bulk load issues on MySQL (lost connection etc),
I've added the ability to split the file into chunks. It works this
way:

post_process :bulk_import, { :file => bulk_file, :columns => target_fields,
  :field_separator => ',', :target => CONFIG, :table => table,
:rows_per_chunk => 10000 }

rows_per_chunk defaults to false, which does not split the files at
all (current behaviour).

Is it interesting to others and should I commit this ? Any comments or
remarks on naming or behaviour ? I'm pretty sure the code can be
simplified (first version of the patch below if you care of the
implementation details).

cheers
-- Thibaut


@@ -21,6 +21,10 @@
       attr_accessor :field_enclosure
       # The line separator (defaults to a newline)
       attr_accessor :line_separator
+      # How many rows should be sent at a time (defaults to false =>
all rows in one chunk)
+      attr_accessor :rows_per_chunk
+      # Chunk file name (defaults to file + '.chunk' )
+      attr_accessor :chunk_file

       # Initialize the processor.
       #
@@ -33,7 +37,9 @@
       #   the bulk data file
       # * <tt>:field_separator</tt>: The field separator. Defaults to a comma
       # * <tt>:line_separator</tt>: The line separator. Defaults to a newline
-      # * <tt>:field_enclosure</tt>: The field enclosure charcaters
+      # * <tt>:field_enclosure</tt>: The field enclosure characters
+      # * <tt>:rows_per_chunk</tt>: How many rows should be sent at a
time (defaults to false => all rows in one chunk)
+      # * <tt>:chunk_file</tt>: The chunk file name (defaults to file
+ '.chunk' ), when using lines_per_chunk
       def initialize(control, configuration)
         super
         @file = File.join(File.dirname(control.file), configuration[:file])
@@ -44,7 +50,8 @@
         @field_separator = (configuration[:field_separator] || ',')
         @line_separator = (configuration[:line_separator] || "\n")
         @field_enclosure = configuration[:field_enclosure]
-
+        @rows_per_chunk = (configuration[:rows_per_chunk] || false)
+        @chunk_file = (configuration[:chunk_file] || (@file + '.chunk' ))
         raise ControlError, "Target must be specified" unless @target
         raise ControlError, "Table must be specified" unless @table
       end
@@ -65,10 +72,34 @@
             options[:fields][:enclosed_by] = field_enclosure if field_enclosure
             options[:fields][:terminated_by] = line_separator if line_separator
           end
-          conn.bulk_load(file, table_name, options)
+          split_into_chunks(file,rows_per_chunk) do |new_file,rows_count|
+            puts "Bulk loading #{rows_count} rows..."
+            conn.bulk_load(new_file, table_name, options)
+          end
         end
       end
-
+
+      # Split the file into rows_per_chunk, yield a temporary chunk
filename each time
+      def split_into_chunks(filename,rows_per_chunk)
+        if rows_per_chunk
+          File.open(filename) do |input|
+            while not input.eof?
+              rows_count = 0
+              File.open(chunk_file,'w') do |chunk|
+                while true
+                  chunk << input.gets
+                  rows_count += 1
+                  break if (input.lineno % rows_per_chunk == 0) || (input.eof?)
+                end
+              end
+              yield chunk_file,rows_count
+            end
+          end
+        else
+          yield filename
+        end
+      end
+
       def table_name
         ETL::Engine.table(table, ETL::Engine.connection(target))
       end
_______________________________________________
Activewarehouse-discuss mailing list
Activewarehouse-discuss@rubyforge.org
http://rubyforge.org/mailman/listinfo/activewarehouse-discuss

Reply via email to