Re: [Pytables-users] Speed of in-kernel Full-Table Search
Hi Anthony and Antonio, Thanks for your fast responses. It's great to hear all features are now free to use, though I needed one and a half week to get this. The first reference I read to learn the usage of PyTables was Hints for SQL Users [1], where is stated several times, for example in the section ' Creating an index': Indexing is supported in the commercial version of PyTables (PyTablesPro). I would suggest that these texts should be updated. Being convinced it's only available in Pro-Version after I read it so often, I also overread the warning in the PyTables Pro page[2] (As I were only interested in the features not available in the free version I just scrolled down immediately, diagonal reading...). So the next suggestion is to give a color to the warning text there :) [1] http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex http://www.pytables.org/moin/HintsForSQLUsers#Selectingdata [2] http://www.pytables.org/moin/PyTablesPro regards, Sebastian On Mon, Jun 24, 2013 at 4:25 AM, Wagner Sebastian sebastian.wagner...@ait.ac.at wrote: Dear PyTables-Users, ** ** For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. ** ** So the thing I don?t understand: How can PyTables be so fast without any Indexing? Hi Sebastian, First, there is no longer a non-free version of PyTables and v3.0 *does* have indexing capabilities. However, you have to enable them so you probably weren't using them. PyTables is fast because HDF5 is a binary format, it using pthreads under the covers to parallelize some tasks, and it uses numexpr (which is also parallel) to evaluate many expressions. All of these things help make PyTables great! Be Well Anthony Il 24/06/2013 11:25, Wagner Sebastian ha scritto: Dear PyTables-Users, For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. So the thing I don't understand: How can PyTables be so fast without any Indexing? I'm using 3.0.0rc2 coming with WinPython Regards, Sebastian The indexing features of PyTables Pro are now available in the open source version of PyTables since version 2.3 (please see [1]). [1] http://pytables.github.io/release-notes/RELEASE_NOTES_v2.3.x.html#changes-from-2-2-1-to-2-3 ciao -- Antonio Valentino -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
[Pytables-users] writing metadata
Dear PyTables users, I am trying to figure out the best way to write some metadata into some files I have. The hdf5 file looks like /root/data_1/stat /root/data_1/sys where stat and sys are Arrays containing statistical and systematic fluctuations of numerical fits to some data I have. What I would like to do is add another object /root/data_1/fit where fit is just a metadata key that describes all the choices I made in performing the fit, such as seed for the random number generator, and many choices for fitting options, like initial guess values of parameters, fitting range, etc. I began to follow the example in the PyTables manual, in Section 1.2 The Object Tree, where first a class is defined class Particle(tables.IsDescription): identity = tables.StringCol(itemsize=22, dflt= , pos=0) ... and then this class is used to populate a table. In my case, I won't have a table, but really just want a single object containing my metadata. I am wondering if there is a recommended way to do this? The Table does not seem optimal, but I don't see what else I would use. Thanks, Andre -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] writing metadata
Another option is to create a Python object - dict, list, or whatever works - containing the metadata and then store a pickled version of it in a PyTables array. It's nice for this sort of thing because you have the full flexibility of Python's data containers. For example, if the Python object is called 'fit', then numpy.frombuffer(pickle.dumps(fit), 'u1') will pickle it and convert the result to a NumPy array of unsigned bytes. It can be stored in a PyTables array using a UInt8Atom. To retrieve the Python object, just use pickle.loads(hdf5_file.root.data_1.fit[:]). It gets a little more complicated if you want to be able to modify the Python object, because the length of the pickle will change. In that case, you can use an EArray (for the case when the pickle grows), and store the number of bytes as an attribute. Storing the number of bytes handles the case when the pickle shrinks and doesn't use the full length of the on-disk array. To load it, use pickle.loads(hdf5_file.root.data_1.fit[:num_bytes]), where num_bytes is the previously stored attribute. To modify it, just overwrite the array with the new version, expanding if necessary, then update the num_bytes attribute. Using a PyTables VLArray with an 'object' atom uses a similar technique under the hood, so that may be easier. It doesn't allow resizing though. Hope that helps, Josh On Tue, Jun 25, 2013 at 1:33 AM, Andreas Hilboll li...@hilboll.de wrote: On 25.06.2013 10:26, Andre' Walker-Loud wrote: Dear PyTables users, I am trying to figure out the best way to write some metadata into some files I have. The hdf5 file looks like /root/data_1/stat /root/data_1/sys where stat and sys are Arrays containing statistical and systematic fluctuations of numerical fits to some data I have. What I would like to do is add another object /root/data_1/fit where fit is just a metadata key that describes all the choices I made in performing the fit, such as seed for the random number generator, and many choices for fitting options, like initial guess values of parameters, fitting range, etc. I began to follow the example in the PyTables manual, in Section 1.2 The Object Tree, where first a class is defined class Particle(tables.IsDescription): identity = tables.StringCol(itemsize=22, dflt= , pos=0) ... and then this class is used to populate a table. In my case, I won't have a table, but really just want a single object containing my metadata. I am wondering if there is a recommended way to do this? The Table does not seem optimal, but I don't see what else I would use. For complex information I'd probably indeed use a table object. It doesn't matter if the table only has one row, but still you have all the information there nicely structured. -- Andreas. -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] writing metadata
Also, depending on how much meta data you really needed to store you could just use attributes. That is what they are there for. On Tue, Jun 25, 2013 at 10:06 AM, Josh Ayers josh.ay...@gmail.com wrote: Another option is to create a Python object - dict, list, or whatever works - containing the metadata and then store a pickled version of it in a PyTables array. It's nice for this sort of thing because you have the full flexibility of Python's data containers. For example, if the Python object is called 'fit', then numpy.frombuffer(pickle.dumps(fit), 'u1') will pickle it and convert the result to a NumPy array of unsigned bytes. It can be stored in a PyTables array using a UInt8Atom. To retrieve the Python object, just use pickle.loads(hdf5_file.root.data_1.fit[:]). It gets a little more complicated if you want to be able to modify the Python object, because the length of the pickle will change. In that case, you can use an EArray (for the case when the pickle grows), and store the number of bytes as an attribute. Storing the number of bytes handles the case when the pickle shrinks and doesn't use the full length of the on-disk array. To load it, use pickle.loads(hdf5_file.root.data_1.fit[:num_bytes]), where num_bytes is the previously stored attribute. To modify it, just overwrite the array with the new version, expanding if necessary, then update the num_bytes attribute. Using a PyTables VLArray with an 'object' atom uses a similar technique under the hood, so that may be easier. It doesn't allow resizing though. Hope that helps, Josh On Tue, Jun 25, 2013 at 1:33 AM, Andreas Hilboll li...@hilboll.de wrote: On 25.06.2013 10:26, Andre' Walker-Loud wrote: Dear PyTables users, I am trying to figure out the best way to write some metadata into some files I have. The hdf5 file looks like /root/data_1/stat /root/data_1/sys where stat and sys are Arrays containing statistical and systematic fluctuations of numerical fits to some data I have. What I would like to do is add another object /root/data_1/fit where fit is just a metadata key that describes all the choices I made in performing the fit, such as seed for the random number generator, and many choices for fitting options, like initial guess values of parameters, fitting range, etc. I began to follow the example in the PyTables manual, in Section 1.2 The Object Tree, where first a class is defined class Particle(tables.IsDescription): identity = tables.StringCol(itemsize=22, dflt= , pos=0) ... and then this class is used to populate a table. In my case, I won't have a table, but really just want a single object containing my metadata. I am wondering if there is a recommended way to do this? The Table does not seem optimal, but I don't see what else I would use. For complex information I'd probably indeed use a table object. It doesn't matter if the table only has one row, but still you have all the information there nicely structured. -- Andreas. -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] Speed of in-kernel Full-Table Search
Hi Sebastian, Il 25/06/2013 09:36, Wagner Sebastian ha scritto: Hi Anthony and Antonio, Thanks for your fast responses. It's great to hear all features are now free to use, though I needed one and a half week to get this. The first reference I read to learn the usage of PyTables was Hints for SQL Users [1], where is stated several times, for example in the section ' Creating an index': Indexing is supported in the commercial version of PyTables (PyTablesPro). I would suggest that these texts should be updated. Being convinced it's only available in Pro-Version after I read it so often, I also overread the warning in the PyTables Pro page[2] (As I were only interested in the features not available in the free version I just scrolled down immediately, diagonal reading...). So the next suggestion is to give a color to the warning text there :) [1] http://www.pytables.org/moin/HintsForSQLUsers#Creatinganindex http://www.pytables.org/moin/HintsForSQLUsers#Selectingdata [2] http://www.pytables.org/moin/PyTablesPro regards, Sebastian thank you for reporting the issue, I will fix it ASAP. The same problem also affect the corresponding cookbook page [1]. Anyway, please, feel free to update the wiki if you find outdated material. [1] http://pytables.github.io/cookbook/hints_for_sql_users.html On Mon, Jun 24, 2013 at 4:25 AM, Wagner Sebastian sebastian.wagner...@ait.ac.at wrote: Dear PyTables-Users, ** ** For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. ** ** So the thing I don?t understand: How can PyTables be so fast without any Indexing? Hi Sebastian, First, there is no longer a non-free version of PyTables and v3.0 *does* have indexing capabilities. However, you have to enable them so you probably weren't using them. PyTables is fast because HDF5 is a binary format, it using pthreads under the covers to parallelize some tasks, and it uses numexpr (which is also parallel) to evaluate many expressions. All of these things help make PyTables great! Be Well Anthony Il 24/06/2013 11:25, Wagner Sebastian ha scritto: Dear PyTables-Users, For testing purposes I use a PyTables DB with 4 columns (1x Uint8 and 3xFloat) with 750k rows, the total file size about 90MB. As the free version does no support indexing I thought that a search (full-table) on this database would last a least one or two seconds, because the file has to be loaded first (throttleneck I/O), and then the search over ~20k rows can begin. But PyTables took only 0.05 seconds for a full table search (in-kernel, so near C-speed, but nevertheless full table), while my bisecting algorithm with a precomputed sorted list wrapped around PyTables (but saved in there), took about 0.5 seconds. So the thing I don't understand: How can PyTables be so fast without any Indexing? I'm using 3.0.0rc2 coming with WinPython Regards, Sebastian The indexing features of PyTables Pro are now available in the open source version of PyTables since version 2.3 (please see [1]). [1] http://pytables.github.io/release-notes/RELEASE_NOTES_v2.3.x.html#changes-from-2-2-1-to-2-3 ciao -- Antonio Valentino -- Antonio Valentino -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users
Re: [Pytables-users] writing metadata
Hi Andreas, Josh, Anthony and Antonio, Thanks for your help. Andre On Jun 26, 2013, at 2:48 AM, Antonio Valentino wrote: Hi Andre', Il 25/06/2013 10:26, Andre' Walker-Loud ha scritto: Dear PyTables users, I am trying to figure out the best way to write some metadata into some files I have. The hdf5 file looks like /root/data_1/stat /root/data_1/sys where stat and sys are Arrays containing statistical and systematic fluctuations of numerical fits to some data I have. What I would like to do is add another object /root/data_1/fit where fit is just a metadata key that describes all the choices I made in performing the fit, such as seed for the random number generator, and many choices for fitting options, like initial guess values of parameters, fitting range, etc. I began to follow the example in the PyTables manual, in Section 1.2 The Object Tree, where first a class is defined class Particle(tables.IsDescription): identity = tables.StringCol(itemsize=22, dflt= , pos=0) ... and then this class is used to populate a table. In my case, I won't have a table, but really just want a single object containing my metadata. I am wondering if there is a recommended way to do this? The Table does not seem optimal, but I don't see what else I would use. Thanks, Andre For leaf nodes (Tables, Array, ets) you can use the attrs attribute set [1] as described in [2]. For group objects (like e.g. root) you can use the set_node_attr method [3] of File objects or _v_attrs. cheers [1] http://pytables.github.io/usersguide/libref/declarative_classes.html#attributesetclassdescr [2] http://pytables.github.io/usersguide/tutorials.html#setting-and-getting-user-attributes [3] http://pytables.github.io/usersguide/libref/file_class.html#tables.File.set_node_attr -- Antonio Valentino -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users -- This SF.net email is sponsored by Windows: Build for Windows Store. http://p.sf.net/sfu/windows-dev2dev ___ Pytables-users mailing list Pytables-users@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/pytables-users