Hi Nitin,
I think before getting into details, you need to look into how to efficiently read and write data from CSV files into HDF5 in Python. For this, pandas is a great library to use. My advice is to have a look at the excellent documentation in pandas website: http://pandas.pydata.org/pandas-docs/stable/io.html In particular, you want to use the `pandas.read_csv()` which one of the fastest ways to read CSV files that I am aware of. Also, for storing the data in HDF5, `pandas.HDFStore()` comes handy because it can generate HDF5 files out of pandas Dataframes. In addition, in order to avoid loading all the data in a Dataframe in memory, you want to use the `chunksize` keyword that will allow to read the CSV files in chunks before storing. I have prepared an example for you (attached) so that you can have a look at how to use all of this (it is simpler than it may seem). Here it is the output on my machine: $ python csv_demo.py CSV creation time: 1.491 (67.092 Krow/s) CSV reading time: 0.134 (748.360 Krow/s) HDF5 store time: 0.322 (310.228 Krow/s) HDF5 read time: 0.006 (15622.990 Krow/s) so, once the data is stored in HDF5, the read times will be much faster than using CSV (as expected). HTH, Francesc
from time import time import os import numpy as np import pandas as pd FNAME = "demo.csv" H5NAME = "demo.h5" NROWS = 1000 * 100 NCOLS = 10 def create_csv(fname): """Create a CSV fname with NROWS x NCOLS filled with random data.""" t0 = time() with open("demo.csv", "wt") as f: for nrow in range(NROWS): if nrow == 0: f.write(",".join(str(i) for i in range(NCOLS)) + '\n') row = np.random.randint(NROWS, NROWS + 100, NCOLS) f.write(",".join("%s" % d for d in row) + '\n') t = time() - t0 print("CSV creation time: %.3f (%.3f Krow/s)" % (t, NROWS / (1000 * t))) # Read all rows def read_rows(fname): t0 = time() reader = pd.read_csv(fname, chunksize=1000) for n, chunk in enumerate(reader): pass t = time() - t0 print("CSV reading time: %.3f (%.3f Krow/s)" % (t, NROWS / (1000 * t))) def store_hdf5(fname, h5name): t0 = time() reader = pd.read_csv(fname, chunksize=1000) with pd.HDFStore(h5name, mode="w", complevel=9, complib='blosc') as store: for n, chunk in enumerate(reader): store.append('table', chunk, index=False) t = time() - t0 print("HDF5 store time: %.3f (%.3f Krow/s)" % (t, NROWS / (1000 * t))) def read_hdf5(h5name): t0 = time() with pd.HDFStore(h5name) as store: df = store['table'] t = time() - t0 print("HDF5 read time: %.3f (%.3f Krow/s)" % (t, NROWS / (1000 * t))) if __name__ == "__main__": if not os.path.exists(FNAME): create_csv(FNAME) read_rows(FNAME) store_hdf5(FNAME, H5NAME) read_hdf5(H5NAME)
_______________________________________________ Hdf-forum is for HDF software users discussion. Hdf-forum@lists.hdfgroup.org http://lists.hdfgroup.org/mailman/listinfo/hdf-forum_lists.hdfgroup.org Twitter: https://twitter.com/hdf5