Hi Håkon,

HDF5 does not compress variable length data well (basically you are trying to compress addresses of pointers to variable length data). For best performance, you should not use
compression for variable length strings.

Thanks
--pc


Håkon Sagehaug wrote:
Hi again,

In my current application that you help with, I know how many lines I want to write before hand, but I also wanted to get working the use case when I don't know this before hand. So i tried to modify the program you more or less wrote for me. My test file contains many lines each with one integer on it like this,

31643
36594
59354
2481
64079
64181
491566836

In the test program below, I've set the initial size to be 2 for the segments to write to the dataset. After the first segment is written I need to extend the dataset set, select the portion to write using hyperslab and the write it, but having some problems. I tried to follow the example here [1], but have not succeeded. I pasted in the code below

public class HDFExtendLDData { private final static String H5_FILE = "/scratchtestHap/strings.h5";
    private final static String DNAME_SNP = "/snp.id.one";
    private final static int RANK = 1;
private final static long[] MAX_DIMS = { HDF5Constants.H5S_UNLIMITED }; /***
     * Creates a dataset for holding values of type integer, with a given
     * dimension, chucking and a group name.
     *
     * @param fid
     * @param dims
     * @param chunkSize
     * @param groupName
     * @throws Exception
     */
private void createIntegerDataset(int fid, long[] dims, long[] chunkSize,
            String groupName) throws Exception {
int did_snp = -1, type_int_id = -1, sid = -1, plist = -1, group_id = -1;

        try {
            type_int_id = H5.H5Tcopy(HDF5Constants.H5T_STD_I32LE);

            sid = H5.H5Screate_simple(RANK, dims, MAX_DIMS);

            plist = H5.H5Pcreate(HDF5Constants.H5P_DATASET_CREATE);

            H5.H5Pset_layout(plist, HDF5Constants.H5D_CHUNKED);

            H5.H5Pset_chunk(plist, RANK, chunkSize);

            H5.H5Pset_deflate(plist, 6);

group_id = H5.H5Gcreate(fid, groupName, HDF5Constants.H5P_DEFAULT);

            did_snp = H5.H5Dcreate(group_id, groupName + DNAME_SNP,
                    type_int_id, sid, plist);

            System.out.println("created for chr " + groupName);

        } finally {
            try {
                H5.H5Pclose(plist);

            } catch (HDF5Exception ex) {
            }
            try {
                H5.H5Sclose(sid);

            } catch (HDF5Exception ex) {
            }
            try {
                H5.H5Dclose(did_snp);

                H5.H5Gclose(group_id);
            } catch (HDF5Exception ex) {
            }

        }
    }
/*** * Input a directory path, that contains some files. Need to extract the * data from the files and create one data set for each file within one hdf
     * file.
     *
     * @param fid
     *            Hdf File id
     * @param sourceFolder
     *            Path to the folder
     * @throws Exception
     */
    private void writeDataFromFileToInt(int fid, String sourceFolder)
            throws Exception {
int did_SNP = -1, msid = -1, fsid = -1, timesWritten = 0, group_id = -1;

        Collection<File> fileCollection = FileUtils.listFiles(new File(
                sourceFolder), new String[] { "txt" }, false);

        int filesAdded = 0;

        /* Loop through a directory of files. */
        for (File sourceFile : fileCollection) {
            try {
String choromosome_tmp = sourceFile.getName().split("_")[1];
                String choromosome = choromosome_tmp.substring(3,
                        choromosome_tmp.length());
                choromosome = "/" + choromosome + "/";

                /* Setting the initial size of the data set */
                long[] DIMS = { 2 };
                long[] CHUNK_SIZE = { 2 };
                int BLOCK_SIZE = 2;

                long[] count = { BLOCK_SIZE };

                /* Creates a new data set for the file to parse */
                createIntegerDataset(fid, DIMS, CHUNK_SIZE, choromosome);

                /* open the group that holds the data set */
                group_id = H5.H5Gopen(fid, choromosome);

                /* open the data set */
                did_SNP = H5.H5Dopen(group_id, choromosome + DNAME_SNP);

                /* fetches the data type, should be integer */
                int type_int_id = H5.H5Dget_type(did_SNP);

                fsid = H5.H5Dget_space(did_SNP);

                /* Memeory space */
                msid = H5.H5Screate_simple(RANK, count, null);

                /* Array for storing the values */
                int[] currentSNPIdArray = new int[BLOCK_SIZE];

                /* File to read teh values from */
BigFile ldFile = new BigFile(sourceFile.getAbsolutePath());

                int idx = 0, block_indx = 0, start_idx = 0;
                System.out.println("Started to parse the file");

                int currentLine = 0;
                timesWritten = 0;

                /* Iterating over each line in the file */
                for (String ldLine : ldFile) {

                    currentSNPIdArray[idx] = Integer.valueOf(ldLine);

                    idx++;

                    if (idx == BLOCK_SIZE) {
                        idx = 0;
                        if (timesWritten == 0) {
                            /* Just write to the data set */
                            H5
                                    .H5Sselect_hyperslab(fsid,
                                            HDF5Constants.H5S_SELECT_SET,
new long[] { start_idx }, null,
                                            count, null);
                            H5.H5Dwrite(did_SNP, type_int_id, msid, fsid,
                                    HDF5Constants.H5P_DEFAULT,
                                    currentSNPIdArray);
                        } else {
                            /* Need to extend the data set */
                            H5.H5Dextend(did_SNP, DIMS);
                            int extended_dataspace_id = H5
                                    .H5Dget_space(did_SNP);
                            H5.H5Sselect_all(extended_dataspace_id);
                            H5
.H5Sselect_hyperslab(extended_dataspace_id,
                                            HDF5Constants.H5S_SELECT_SET,
new long[] { start_idx }, null,
                                            count, null);
                            H5.H5Dwrite(did_SNP, type_int_id, msid,
                                    extended_dataspace_id,
                                    HDF5Constants.H5P_DEFAULT,
                                    currentSNPIdArray);
                        }

                        block_indx++;
                        start_idx = currentLine + 1;
                        timesWritten++;

                    }

                    currentLine++;

                }
                filesAdded++;

                System.out.println("Finished parsing the file ");

            } finally {
                try {
                    H5.H5Gclose(group_id);
                    H5.H5Sclose(fsid);

                } catch (HDF5Exception ex) {
                }
                try {
                    H5.H5Sclose(msid);
                } catch (HDF5Exception ex) {
                }
                try {
                    H5.H5Dclose(did_SNP);
                } catch (HDF5Exception ex) {
                }
            }
        }

    }

    public void createFile(String sourceFile) throws Exception {
        int fid = -1;

        fid = H5.H5Fcreate(H5_FILE, HDF5Constants.H5F_ACC_TRUNC,
                HDF5Constants.H5P_DEFAULT, HDF5Constants.H5P_DEFAULT);

        if (fid < 0)
            return;

        try {
            writeDataFromFileToInt(fid, sourceFile);
        } finally {
            H5.H5Fclose(fid);
        }
    }

When running the code I get this error

Exception in thread "main" ncsa.hdf.hdf5lib.exceptions.HDF5LibraryException
    at ncsa.hdf.hdf5lib.H5.H5Dwrite_int(Native Method)
    at ncsa.hdf.hdf5lib.H5.H5Dwrite(H5.java:1139)
    at ncsa.hdf.hdf5lib.H5.H5Dwrite(H5.java:1181)
at no.uib.bccs.esysbio.sample.clients.HDFExtendLDData.writeDataFromFileToInt(HDFExtendLDData.java:145)

The line number in my code corresponds to where I'm writing to the dataset after I've extended it.

Any tips on how to solve the issue?

cheers, Håkon

[1] http://www.hdfgroup.org/ftp/HDF5/examples/examples-by-api/java/examples/datasets/H5Ex_D_UnlimitedAdd.java


On 25 March 2010 16:13, Peter Cao <[email protected] <mailto:[email protected]>> wrote:

    For compression, block size does not matter. Chunk size matters.
    Usually larger chunk size tends
    to compress better. We usually use 64KB to 1MB for chunk size for
    better performance. Try
    different chunk size, block size, and compression methods and
    level to have the best I/O performance
    and compression ratio. As I mentioned earlier, if the content is
    random, the compression will not help much.


    Thanks
    --pc


    Håkon Sagehaug wrote:

        Hi

        Yes the content is more or less a random set of charcters.
        I'll try some combinations and see what is the best. We need
        to transfer the file over a network, so thats why we need to
        compress as much as possible. Will the block size/chunk size
        have anything to say?

        cheers, Håkon

        On 25 March 2010 15:48, Peter Cao <[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>> wrote:

           Hi Håkon,

           I don't need the code. As long as it works for you. I am happy.

           Deflate level 6 is a good combination between file size and
           performance.
           The compression ratio depends on the content. If every
        string is
           like a
           random set of characters, the compression will not do much
        help. I
           will leave
           it to you to try different compression options. If compression
           does not do
           much help, it will be much better not using compression at all.
           It's your call.


           Thanks
           --pc



           Håkon Sagehaug wrote:



               Hi Peter

               Thanks for all the help so far, I've added code to add the
               last elements, if you want to have it i can past it in
        a new
               email to you. One more question, we need to compress
        the data
               I've now tried like this, within createDataset(...)


               H5.H5Pset_layout(plist, HDF5Constants.H5D_CHUNKED);
               H5.H5Pset_chunk(plist, RANK, chunkSize);
               H5.H5Pset_deflate(plist, 9);

               I'm not sure what is the most efficient way, tried to
        exchange
               the H5Pset_deflate(plist, 9) with

               H5.H5Pset_szip(plist,
        HDF5Constants.H5_SZIP_NN_OPTION_MASK, 8);

               but did not see any diffrens.  I read the szip would
        maybe be
               better. If  I don't use deflate the hdf file is 1.5 gb with
               deflate it's 1.3 gb. So my hopes is that it can be further
               decreased in size.

               cheers, Håkon

               On 24 March 2010 17:25, Peter Cao <[email protected]
        <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>>>
        wrote:

                  Hi Håkon,

                  Glad to know it work for you. Also you need to take
        care of the
                  case that
                  the last block does not have the size of BLOCK_SIZE.
        This
               will happen
                  if the total size (25M) is not divided by
        BLOCK_SIZE. For
               better
                  performance,
                  make sure that BLOCK_SIZE is divided by CHUNK_SIZE.


                  Thanks
                  --pc


                  Håkon Sagehaug wrote:

                      Hi Peter,

                      Thanks so much for the code, seems to work very
        well,
               the only
                      thing I found was that when the index for next
        index to
               write
                      in the hdf array, I had to add 1 to it, so
        instead of

                         start_idx = i;

                      I now have

                         start_idx = i + 1;

                      cheers, Håkon




                      On 24 March 2010 01:19, Peter Cao
        <[email protected] <mailto:[email protected]>
               <mailto:[email protected] <mailto:[email protected]>>
                      <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
               <mailto:[email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>

                      <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>>>

               wrote:

                         Hi Håkon,

                         Below is the program that you can start with.
        I am using
                      variable
                         length strings.
                         For fixed length strings, there are some
        extra work. You
                      may have
                         to make the
                         strings to the same length.

                         You may try different chunk sizes and block
        sizes to
               have
                      the best
                         performance.

                         =======================
                         import ncsa.hdf.hdf5lib.H5;
                         import ncsa.hdf.hdf5lib.HDF5Constants;
                         import ncsa.hdf.hdf5lib.exceptions.HDF5Exception;

                         public class CreateStrings {

                           private final static String H5_FILE =
               "G:\\temp\\strings.h5";
                           private final static String DNAME = "/strs";
                           private final static int RANK = 1;
                           private final static long[] DIMS = {
        25000000 };
                           private final static long[] MAX_DIMS = {
                         HDF5Constants.H5S_UNLIMITED };
                           private final static long[] CHUNK_SIZE = {
        25000 };
                           private final static int BLOCK_SIZE = 250000;

                           private void createDataset(int fid) throws
        Exception {
                               int did = -1, tid = -1, sid = -1, plist
        = -1;

                               try {

                                   tid =
        H5.H5Tcopy(HDF5Constants.H5T_C_S1);
                                   // use variable length to save space
                                   H5.H5Tset_size(tid,
               HDF5Constants.H5T_VARIABLE);
                                   sid = H5.H5Screate_simple(RANK, DIMS,
               MAX_DIMS);

                                   // figure out creation properties

                                   plist =
                      H5.H5Pcreate(HDF5Constants.H5P_DATASET_CREATE);
                                   H5.H5Pset_layout(plist,
               HDF5Constants.H5D_CHUNKED);
                                   H5.H5Pset_chunk(plist, RANK,
        CHUNK_SIZE);

                                   did = H5.H5Dcreate(fid, DNAME, tid,
        sid,
               plist);
                               } finally {
                                   try {
                                       H5.H5Pclose(plist);
                                   } catch (HDF5Exception ex) {
                                   }
                                   try {
                                       H5.H5Sclose(sid);
                                   } catch (HDF5Exception ex) {
                                   }
                                   try {
                                       H5.H5Dclose(did);
                                   } catch (HDF5Exception ex) {
                                   }
                               }
                           }

                           private void writeData(int fid) throws
        Exception {
                               int did = -1, tid = -1, msid = -1, fsid
        = -1;
                               long[] count = { BLOCK_SIZE };

                               try {
                                   did = H5.H5Dopen(fid, DNAME);
                                   tid = H5.H5Dget_type(did);
                                   fsid = H5.H5Dget_space(did);
                                   msid = H5.H5Screate_simple(RANK,
        count, null);
                                   String[] strs = new String[BLOCK_SIZE];

                                   int idx = 0, block_indx = 0,
        start_idx = 0;
                                   long t0 = 0, t1 = 0;
                                   t0 = System.currentTimeMillis();
                                   System.out.println("Total number of
        blocks = "
                                           + (DIMS[0] / BLOCK_SIZE));
                                   for (int i = 0; i < DIMS[0]; i++) {
                                       strs[idx++] = "str" + i;
                                       if (idx == BLOCK_SIZE) { //
        operator % is
                      very expensive
                                           idx = 0;
                                           H5.H5Sselect_hyperslab(fsid,
                         HDF5Constants.H5S_SELECT_SET,
                                                   new long[] {
        start_idx },
               null,
                      count,
                         null);
                                           H5.H5Dwrite(did, tid, msid,
        fsid,
HDF5Constants.H5P_DEFAULT,
               strs);

                                           if (block_indx == 10) {
                                               t1 =
        System.currentTimeMillis();
System.out.println("Total time
                      (minutes) = "
                                                       + ((t1 - t0) *
        (DIMS[0] /
                         BLOCK_SIZE)) / 1000
                                                       / 600);
                                           }

                                           block_indx++;
                                           start_idx = i;
                                       }

                                   }

                               } finally {
                                   try {
                                       H5.H5Sclose(fsid);
                                   } catch (HDF5Exception ex) {
                                   }
                                   try {
                                       H5.H5Sclose(msid);
                                   } catch (HDF5Exception ex) {
                                   }
                                   try {
                                       H5.H5Dclose(did);
                                   } catch (HDF5Exception ex) {
                                   }
                               }
                           }

                           private void createFile() throws Exception {
                               int fid = -1;

                               fid = H5.H5Fcreate(H5_FILE,
               HDF5Constants.H5F_ACC_TRUNC,

                                       HDF5Constants.H5P_DEFAULT,
                      HDF5Constants.H5P_DEFAULT);

                               if (fid < 0)
                                   return;

                               try {
                                   createDataset(fid);
                                   writeData(fid);
                               } finally {
                                   H5.H5Fclose(fid);
                               }
                           }

                           /**
                            * @param args
                            */
                           public static void main(String[] args) {
                               try {
                                   (new CreateStrings()).createFile();
                               } catch (Exception ex) {
                                   ex.printStackTrace();
                               }
                           }

                         }
                         =========================




                         _______________________________________________
                         Hdf-forum is for HDF software users discussion.
                         [email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>>
                      <mailto:[email protected]
        <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>
        <mailto:[email protected] <mailto:[email protected]>
               <mailto:[email protected]
        <mailto:[email protected]>>>>


http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org




------------------------------------------------------------------------



                      _______________________________________________
                      Hdf-forum is for HDF software users discussion.
                      [email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
               <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org _______________________________________________
                  Hdf-forum is for HDF software users discussion.
                  [email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>
               <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org







               --         Håkon Sagehaug, Scientific Programmer
               Parallab, Uni BCCS/Uni Research
               [email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
               <mailto:[email protected]
        <mailto:[email protected]> <mailto:[email protected]
        <mailto:[email protected]>>>,

               phone +47 55584125

------------------------------------------------------------------------

               _______________________________________________
               Hdf-forum is for HDF software users discussion.
               [email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
           _______________________________________________
           Hdf-forum is for HDF software users discussion.
           [email protected] <mailto:[email protected]>
        <mailto:[email protected] <mailto:[email protected]>>
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org





        ------------------------------------------------------------------------

        _______________________________________________
        Hdf-forum is for HDF software users discussion.
        [email protected] <mailto:[email protected]>
        http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

    _______________________________________________
    Hdf-forum is for HDF software users discussion.
    [email protected] <mailto:[email protected]>
    http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org



------------------------------------------------------------------------

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to