Dynamic File Name in CSVRead/CSVWRITE & Performance considerations (changed subject line)

vips Mon, 18 Jan 2010 23:32:33 -0800

Hey Thomas,

Also wanted to share the use case where we are planning to use H2.

CSV files will be continuously generated by various processes.
The goal is to generate KPI statistics at an hourly, daily, weekly,
monthly and yearly level using various columns in these CSV files.

The initial basic design being thought of is as follows -

1. Keep polling for CSV files in a set of configured directories.

2. For each file do the following -
    a. Create a H2 table with same name as the file name prefix and
columns as per an already existing template table using something like
-
         create table <FILENAME> as select * from TEMPLATETABLE where
1=2
    b. Load the CSV contents into the newly created table using
something like -
         insert into <FILENAME> (select * from CSVREAD('?',
'<columnNameString>')
OR
         create table <FILENAME> as (select * from CSVREAD('?',
'<columnNameString>')
    c. Pass this <FILENAME> to multiple daemon threads. Each thread
has 1 associated H2 table which will be a hourly Aggregate Table.
        Hence multiple threads will be reading from the same
<FILENAME> table and will be computing different KPI stats into their
own hourly Aggregate Tables.
        Each of these threads uses this table <FILENAME> to compute
some KPI statistics into its own aggregate tables using something like
-
        merge into <AGGREGATETABLE1_HH> (select FORMATDATE(DATECOL,
'HH'), COL1, COL2, COL3, sum(col4 - col5), count(*), sum(col6) from
<FILENAME>  where COL7=1 group by FORMATDATE(DATECOL, 'HH'), COL1,
COL2, COL3)

3. After a clock hour is over, another daemon thread will read from
all the hourly aggregate tables and dump into CSV files using
something like -
   call CSVWRITE('?', 'select * from AGGREGATETABLE1_HH')

4. Once the KPI stats have been written in the output csv files, the
previous set of tables will be dropped.
    drop table <FILENAME>
    drop table <AGGREGATETABLE1_HH>

5. The hourly KPI stats in the CSV files will be used to generate
daily, weekly, monthly and yearly KPI stats.

The input CSV files will be coming @ following rate -
One 5 MB CSV file with approx 30K records will be available for
processing approximately every 3 seconds (10K/sec).
This means that the above mentioned steps 1 and 2 need to be finished
in 3 seconds for every file.
In other words, in 1 second 10K records should be read from the CSV,
inserted into a temporary table and various aggregate tables should
finish processing from the temp table.

This can be achieved either by -
a. Running just 1 in-memory H2 instance for all files and KPIs.
b. Running multiple in-memory H2 instances each handling a distinct
set of input files and KPIs.

Few questions and comments here are -
1. Is the overall design feasible using H2. Especially considering
this is supposed to be an always on 99.99% available system.
2. Tables will be continuously created and dropped @ at least 10
tables every 3 seconds. (1 table for each set of input file). Would
that create any issues?
3. Files will continuously be read from CSV and written back into CSV.
I guess the buffer size used while reading and writing files should be
configurable so as to speed up CSV reading and writing.
4. The file names in the CSVREAD and CSVWRITE both need to be dynamic
so that they can be specified during execution time like in prepared
statements.

Any suggestions n ideas are welcome.

~Vipul.

On Jan 19, 11:40 am, vips <[email protected]> wrote:
> Hi Thomas,
>
> Yes, I do know the column names.
> The feature request you mentioned is exactly what is needed.
>
> Any idea when will it be available?
> I can do some of the changes too.
>
> Thanks,
> Vipul.
>
> On Jan 17, 8:48 pm, Thomas Mueller <[email protected]>
> wrote:
>
> > Hi,
>
> > The problem is that the parser needs to know the column names at
> > compile time of the query, otherwise it can't validate it. For
> > example: select x from csvread(?) - the compiler can't possibly know
> > if the column x exists.
>
> > If you know the column names, you could pass them as the second
> > parameter to csvread. However, currently the file name is required.
> > There is a feature request for "Allow execution time prepare for
> > SELECT * FROM CSVREAD(?, 'columnNameString')", I will increase the
> > priority. Please tell me if this feature would be enough for your use
> > case (do you know the column names in advance?).
>
> > Regards,
> > Thomas

-- 
You received this message because you are subscribed to the Google Groups "H2 
Database" group.
To post to this group, send email to [email protected].
To unsubscribe from this group, send email to 
[email protected].
For more options, visit this group at 
http://groups.google.com/group/h2-database?hl=en.

Dynamic File Name in CSVRead/CSVWRITE & Performance considerations (changed subject line)

Reply via email to