Hi Nitin,

I think before getting into details, you need to look into how to efficiently 
read and write data from CSV files into HDF5 in Python.  For this, pandas is a 
great library to use.  My advice is to have a look at the excellent 
documentation in pandas website:


http://pandas.pydata.org/pandas-docs/stable/io.html


In particular, you want to use the `pandas.read_csv()` which one of the fastest 
ways to read CSV files that I am aware of.  Also, for storing the data in HDF5, 
`pandas.HDFStore()` comes handy because it can generate HDF5 files out of 
pandas Dataframes.  In addition, in order to avoid loading all the data in a 
Dataframe in memory, you want to use the `chunksize` keyword that will allow to 
read the CSV files in chunks before storing.


I have prepared an example for you (attached) so that you can have a look at 
how to use all of this (it is simpler than it may seem).  Here it is the output 
on my machine:


$ python csv_demo.py
CSV creation time: 1.491 (67.092 Krow/s)
CSV reading time: 0.134 (748.360 Krow/s)
HDF5 store time: 0.322 (310.228 Krow/s)
HDF5 read time: 0.006 (15622.990 Krow/s)


so, once the data is stored in HDF5, the read times will be much faster than 
using CSV (as expected).


HTH,


Francesc
from time import time
import os
import numpy as np
import pandas as pd


FNAME = "demo.csv"
H5NAME = "demo.h5"
NROWS = 1000 * 100
NCOLS = 10


def create_csv(fname):
    """Create a CSV fname with NROWS x NCOLS filled with random data."""
    t0 = time()
    with open("demo.csv", "wt") as f:
        for nrow in range(NROWS):
            if nrow == 0:
                f.write(",".join(str(i) for i in range(NCOLS)) + '\n')
            row = np.random.randint(NROWS, NROWS + 100, NCOLS)
            f.write(",".join("%s" % d for d in row) + '\n')
    t = time() - t0
    print("CSV creation time: %.3f (%.3f Krow/s)" % (t, NROWS / (1000 * t)))

# Read all rows
def read_rows(fname):
    t0 = time()
    reader = pd.read_csv(fname, chunksize=1000)
    for n, chunk in enumerate(reader):
        pass
    t = time() - t0
    print("CSV reading time: %.3f (%.3f Krow/s)" % (t, NROWS / (1000 * t)))

def store_hdf5(fname, h5name):
    t0 = time()
    reader = pd.read_csv(fname, chunksize=1000)
    with pd.HDFStore(h5name, mode="w", complevel=9, complib='blosc') as store:
        for n, chunk in enumerate(reader):
            store.append('table', chunk, index=False)
    t = time() - t0
    print("HDF5 store time: %.3f (%.3f Krow/s)" % (t, NROWS / (1000 * t)))

def read_hdf5(h5name):
    t0 = time()
    with pd.HDFStore(h5name) as store:
        df = store['table']
    t = time() - t0
    print("HDF5 read time: %.3f (%.3f Krow/s)" % (t, NROWS / (1000 * t)))


if __name__ == "__main__":
    if not os.path.exists(FNAME):
        create_csv(FNAME)
    read_rows(FNAME)
    store_hdf5(FNAME, H5NAME)
    read_hdf5(H5NAME)
_______________________________________________
Hdf-forum is for HDF software users discussion.
Hdf-forum@lists.hdfgroup.org
http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org
Twitter: https://twitter.com/hdf5

Reply via email to