Hi Peter,
My question now was more regarding the dynamically adding of data to a
dataset, when I don't know the number of entries in the table I wish
to store in the hdf file. The compression issue is no longer relevant,
because the values we needed to store from the lines in the file where
all integers. I'm struggling to get this part of the code working for me
...
/* Need to extend the data set */
H5.H5Dextend(did_SNP, DIMS);
int extended_dataspace_id = H5
.H5Dget_space(did_SNP);
H5.H5Sselect_all(extended_
dataspace_id);
H5
.H5Sselect_hyperslab(extended_dataspace_id,
HDF5Constants.H5S_SELECT_SET,
new long[] { start_idx }, null,
count, null);
H5.H5Dwrite(did_SNP, type_int_id, msid,
extended_dataspace_id,
HDF5Constants.H5P_DEFAULT,
currentSNPIdArray);
cheers, Håkon
On 12 April 2010 18:46, Peter Cao <[email protected]
<mailto:[email protected]>> wrote:
Hi Håkon,
HDF5 does not compress variable length data well (basically you
are trying to compress
addresses of pointers to variable length data). For best
performance, you should not use
compression for variable length strings.
Thanks
--pc
Håkon Sagehaug wrote:
Hi again,
In my current application that you help with, I know how many
lines I want to write before hand, but I also wanted to get
working the use case when I don't know this before hand. So i
tried to modify the program you more or less wrote for me. My
test file contains many lines each with one integer on it like
this,
31643
36594
59354
2481
64079
64181
491566836
In the test program below, I've set the initial size to be 2
for the segments to write to the dataset. After the first
segment is written I need to extend the dataset set, select
the portion to write using hyperslab and the write it, but
having some problems. I tried to follow the example here [1],
but have not succeeded. I pasted in the code below
public class HDFExtendLDData { private final static
String H5_FILE = "/scratchtestHap/strings.h5";
private final static String DNAME_SNP = "/snp.id.one";
private final static int RANK = 1;
private final static long[] MAX_DIMS = {
HDF5Constants.H5S_UNLIMITED };
/***
* Creates a dataset for holding values of type integer,
with a given
* dimension, chucking and a group name.
*
* @param fid
* @param dims
* @param chunkSize
* @param groupName
* @throws Exception
*/
private void createIntegerDataset(int fid, long[] dims,
long[] chunkSize,
String groupName) throws Exception {
int did_snp = -1, type_int_id = -1, sid = -1, plist =
-1, group_id = -1;
try {
type_int_id = H5.H5Tcopy(HDF5Constants.H5T_STD_I32LE);
sid = H5.H5Screate_simple(RANK, dims, MAX_DIMS);
plist = H5.H5Pcreate(HDF5Constants.H5P_DATASET_CREATE);
H5.H5Pset_layout(plist, HDF5Constants.H5D_CHUNKED);
H5.H5Pset_chunk(plist, RANK, chunkSize);
H5.H5Pset_deflate(plist, 6);
group_id = H5.H5Gcreate(fid, groupName,
HDF5Constants.H5P_DEFAULT);
did_snp = H5.H5Dcreate(group_id, groupName + DNAME_SNP,
type_int_id, sid, plist);
System.out.println("created for chr " + groupName);
} finally {
try {
H5.H5Pclose(plist);
} catch (HDF5Exception ex) {
}
try {
H5.H5Sclose(sid);
} catch (HDF5Exception ex) {
}
try {
H5.H5Dclose(did_snp);
H5.H5Gclose(group_id);
} catch (HDF5Exception ex) {
}
}
}
/***
* Input a directory path, that contains some files. Need
to extract the
* data from the files and create one data set for each
file within one hdf
* file.
*
* @param fid
* Hdf File id
* @param sourceFolder
* Path to the folder
* @throws Exception
*/
private void writeDataFromFileToInt(int fid, String
sourceFolder)
throws Exception {
int did_SNP = -1, msid = -1, fsid = -1, timesWritten =
0, group_id = -1;
Collection<File> fileCollection =
FileUtils.listFiles(new File(
sourceFolder), new String[] { "txt" }, false);
int filesAdded = 0;
/* Loop through a directory of files. */
for (File sourceFile : fileCollection) {
try {
String choromosome_tmp =
sourceFile.getName().split("_")[1];
String choromosome = choromosome_tmp.substring(3,
choromosome_tmp.length());
choromosome = "/" + choromosome + "/";
/* Setting the initial size of the data set */
long[] DIMS = { 2 };
long[] CHUNK_SIZE = { 2 };
int BLOCK_SIZE = 2;
long[] count = { BLOCK_SIZE };
/* Creates a new data set for the file to parse */
createIntegerDataset(fid, DIMS, CHUNK_SIZE,
choromosome);
/* open the group that holds the data set */
group_id = H5.H5Gopen(fid, choromosome);
/* open the data set */
did_SNP = H5.H5Dopen(group_id, choromosome +
DNAME_SNP);
/* fetches the data type, should be integer */
int type_int_id = H5.H5Dget_type(did_SNP);
fsid = H5.H5Dget_space(did_SNP);
/* Memeory space */
msid = H5.H5Screate_simple(RANK, count, null);
/* Array for storing the values */
int[] currentSNPIdArray = new int[BLOCK_SIZE];
/* File to read teh values from */
BigFile ldFile = new
BigFile(sourceFile.getAbsolutePath());
int idx = 0, block_indx = 0, start_idx = 0;
System.out.println("Started to parse the file");
int currentLine = 0;
timesWritten = 0;
/* Iterating over each line in the file */
for (String ldLine : ldFile) {
currentSNPIdArray[idx] =
Integer.valueOf(ldLine);
idx++;
if (idx == BLOCK_SIZE) {
idx = 0;
if (timesWritten == 0) {
/* Just write to the data set */
H5
.H5Sselect_hyperslab(fsid,
HDF5Constants.H5S_SELECT_SET,
new long[] {
start_idx }, null,
count, null);
H5.H5Dwrite(did_SNP, type_int_id,
msid, fsid,
HDF5Constants.H5P_DEFAULT,
currentSNPIdArray);
} else {
/* Need to extend the data set */
H5.H5Dextend(did_SNP, DIMS);
int extended_dataspace_id = H5
.H5Dget_space(did_SNP);
H5.H5Sselect_all(extended_dataspace_id);
H5
.H5Sselect_hyperslab(extended_dataspace_id,
HDF5Constants.H5S_SELECT_SET,
new long[] {
start_idx }, null,
count, null);
H5.H5Dwrite(did_SNP, type_int_id, msid,
extended_dataspace_id,
HDF5Constants.H5P_DEFAULT,
currentSNPIdArray);
}
block_indx++;
start_idx = currentLine + 1;
timesWritten++;
}
currentLine++;
}
filesAdded++;
System.out.println("Finished parsing the file ");
} finally {
try {
H5.H5Gclose(group_id);
H5.H5Sclose(fsid);
} catch (HDF5Exception ex) {
}
try {
H5.H5Sclose(msid);
} catch (HDF5Exception ex) {
}
try {
H5.H5Dclose(did_SNP);
} catch (HDF5Exception ex) {
}
}
}
}
public void createFile(String sourceFile) throws Exception {
int fid = -1;
fid = H5.H5Fcreate(H5_FILE, HDF5Constants.H5F_ACC_TRUNC,
HDF5Constants.H5P_DEFAULT,
HDF5Constants.H5P_DEFAULT);
if (fid < 0)
return;
try {
writeDataFromFileToInt(fid, sourceFile);
} finally {
H5.H5Fclose(fid);
}
}
When running the code I get this error
Exception in thread "main"
ncsa.hdf.hdf5lib.exceptions.HDF5LibraryException
at ncsa.hdf.hdf5lib.H5.H5Dwrite_int(Native Method)
at ncsa.hdf.hdf5lib.H5.H5Dwrite(H5.java:1139)
at ncsa.hdf.hdf5lib.H5.H5Dwrite(H5.java:1181)
at
no.uib.bccs.esysbio.sample.clients.HDFExtendLDData.writeDataFromFileToInt(HDFExtendLDData.java:145)
The line number in my code corresponds to where I'm writing to
the dataset after I've extended it.
Any tips on how to solve the issue?
cheers, Håkon
[1]
http://www.hdfgroup.org/ftp/HDF5/examples/examples-by-api/java/examples/datasets/H5Ex_D_UnlimitedAdd.java
On 25 March 2010 16:13, Peter Cao <[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>> wrote:
For compression, block size does not matter. Chunk size
matters.
Usually larger chunk size tends
to compress better. We usually use 64KB to 1MB for chunk
size for
better performance. Try
different chunk size, block size, and compression methods and
level to have the best I/O performance
and compression ratio. As I mentioned earlier, if the
content is
random, the compression will not help much.
Thanks
--pc
Håkon Sagehaug wrote:
Hi
Yes the content is more or less a random set of charcters.
I'll try some combinations and see what is the best. We
need
to transfer the file over a network, so thats why we
need to
compress as much as possible. Will the block size/chunk
size
have anything to say?
cheers, Håkon
On 25 March 2010 15:48, Peter Cao <[email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>>
wrote:
Hi Håkon,
I don't need the code. As long as it works for you.
I am happy.
Deflate level 6 is a good combination between file
size and
performance.
The compression ratio depends on the content. If every
string is
like a
random set of characters, the compression will not
do much
help. I
will leave
it to you to try different compression options. If
compression
does not do
much help, it will be much better not using
compression at all.
It's your call.
Thanks
--pc
Håkon Sagehaug wrote:
Hi Peter
Thanks for all the help so far, I've added code
to add the
last elements, if you want to have it i can past
it in
a new
email to you. One more question, we need to compress
the data
I've now tried like this, within createDataset(...)
H5.H5Pset_layout(plist, HDF5Constants.H5D_CHUNKED);
H5.H5Pset_chunk(plist, RANK, chunkSize);
H5.H5Pset_deflate(plist, 9);
I'm not sure what is the most efficient way,
tried to
exchange
the H5Pset_deflate(plist, 9) with
H5.H5Pset_szip(plist,
HDF5Constants.H5_SZIP_NN_OPTION_MASK, 8);
but did not see any diffrens. I read the szip would
maybe be
better. If I don't use deflate the hdf file is
1.5 gb with
deflate it's 1.3 gb. So my hopes is that it can
be further
decreased in size.
cheers, Håkon
On 24 March 2010 17:25, Peter Cao
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>>>
wrote:
Hi Håkon,
Glad to know it work for you. Also you need
to take
care of the
case that
the last block does not have the size of
BLOCK_SIZE.
This
will happen
if the total size (25M) is not divided by
BLOCK_SIZE. For
better
performance,
make sure that BLOCK_SIZE is divided by
CHUNK_SIZE.
Thanks
--pc
Håkon Sagehaug wrote:
Hi Peter,
Thanks so much for the code, seems to
work very
well,
the only
thing I found was that when the index for
next
index to
write
in the hdf array, I had to add 1 to it, so
instead of
start_idx = i;
I now have
start_idx = i + 1;
cheers, Håkon
On 24 March 2010 01:19, Peter Cao
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>>>>
wrote:
Hi Håkon,
Below is the program that you can
start with.
I am using
variable
length strings.
For fixed length strings, there are some
extra work. You
may have
to make the
strings to the same length.
You may try different chunk sizes and
block
sizes to
have
the best
performance.
=======================
import ncsa.hdf.hdf5lib.H5;
import ncsa.hdf.hdf5lib.HDF5Constants;
import
ncsa.hdf.hdf5lib.exceptions.HDF5Exception;
public class CreateStrings {
private final static String H5_FILE =
"G:\\temp\\strings.h5";
private final static String DNAME =
"/strs";
private final static int RANK = 1;
private final static long[] DIMS = {
25000000 };
private final static long[] MAX_DIMS = {
HDF5Constants.H5S_UNLIMITED };
private final static long[]
CHUNK_SIZE = {
25000 };
private final static int BLOCK_SIZE
= 250000;
private void createDataset(int fid)
throws
Exception {
int did = -1, tid = -1, sid =
-1, plist
= -1;
try {
tid =
H5.H5Tcopy(HDF5Constants.H5T_C_S1);
// use variable length to
save space
H5.H5Tset_size(tid,
HDF5Constants.H5T_VARIABLE);
sid =
H5.H5Screate_simple(RANK, DIMS,
MAX_DIMS);
// figure out creation
properties
plist =
H5.H5Pcreate(HDF5Constants.H5P_DATASET_CREATE);
H5.H5Pset_layout(plist,
HDF5Constants.H5D_CHUNKED);
H5.H5Pset_chunk(plist, RANK,
CHUNK_SIZE);
did = H5.H5Dcreate(fid,
DNAME, tid,
sid,
plist);
} finally {
try {
H5.H5Pclose(plist);
} catch (HDF5Exception ex) {
}
try {
H5.H5Sclose(sid);
} catch (HDF5Exception ex) {
}
try {
H5.H5Dclose(did);
} catch (HDF5Exception ex) {
}
}
}
private void writeData(int fid) throws
Exception {
int did = -1, tid = -1, msid =
-1, fsid
= -1;
long[] count = { BLOCK_SIZE };
try {
did = H5.H5Dopen(fid, DNAME);
tid = H5.H5Dget_type(did);
fsid = H5.H5Dget_space(did);
msid = H5.H5Screate_simple(RANK,
count, null);
String[] strs = new
String[BLOCK_SIZE];
int idx = 0, block_indx = 0,
start_idx = 0;
long t0 = 0, t1 = 0;
t0 = System.currentTimeMillis();
System.out.println("Total
number of
blocks = "
+ (DIMS[0] /
BLOCK_SIZE));
for (int i = 0; i < DIMS[0];
i++) {
strs[idx++] = "str" + i;
if (idx == BLOCK_SIZE) { //
operator % is
very expensive
idx = 0;
H5.H5Sselect_hyperslab(fsid,
HDF5Constants.H5S_SELECT_SET,
new long[] {
start_idx },
null,
count,
null);
H5.H5Dwrite(did,
tid, msid,
fsid,
HDF5Constants.H5P_DEFAULT,
strs);
if (block_indx == 10) {
t1 =
System.currentTimeMillis();
System.out.println("Total time
(minutes) = "
+ ((t1 -
t0) *
(DIMS[0] /
BLOCK_SIZE)) / 1000
/ 600);
}
block_indx++;
start_idx = i;
}
}
} finally {
try {
H5.H5Sclose(fsid);
} catch (HDF5Exception ex) {
}
try {
H5.H5Sclose(msid);
} catch (HDF5Exception ex) {
}
try {
H5.H5Dclose(did);
} catch (HDF5Exception ex) {
}
}
}
private void createFile() throws
Exception {
int fid = -1;
fid = H5.H5Fcreate(H5_FILE,
HDF5Constants.H5F_ACC_TRUNC,
HDF5Constants.H5P_DEFAULT,
HDF5Constants.H5P_DEFAULT);
if (fid < 0)
return;
try {
createDataset(fid);
writeData(fid);
} finally {
H5.H5Fclose(fid);
}
}
/**
* @param args
*/
public static void main(String[] args) {
try {
(new
CreateStrings()).createFile();
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
=========================
_______________________________________________
Hdf-forum is for HDF software users
discussion.
[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
------------------------------------------------------------------------
_______________________________________________
Hdf-forum is for HDF software users
discussion.
[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected] <mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
-- Håkon Sagehaug, Scientific Programmer
Parallab, Uni BCCS/Uni Research
[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
<mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>> <mailto:[email protected]
<mailto:[email protected]>
<mailto:[email protected]
<mailto:[email protected]>>>>,
phone +47 55584125
------------------------------------------------------------------------
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>
<mailto:[email protected]
<mailto:[email protected]> <mailto:[email protected]
<mailto:[email protected]>>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
------------------------------------------------------------------------
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
------------------------------------------------------------------------
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected] <mailto:[email protected]>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected] <mailto:[email protected]>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
------------------------------------------------------------------------
_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org