Hi Peter,

> Kim, how are you? Out of curiosity, I would like to clarify something.

I am a French man working in IT.

I am working on parallel implementation of CRF
(http://en.wikipedia.org/wiki/Conditional_random_field) learning (for
my fun only).

> My naïve impression was that, if your keys are French words, the dataset
> can't be very large.
> Let's say that a highly educated English speaker has about 300,000 words in
> his/her vocabulary.
> Let's give a French speaker 500,000 words. Add two integers (8 bytes per
> entry) as payload.

In this hashTable i have one entry bye feature in my corpus and my
corpus is huge an extract of the French wikipedia.
The example en my last email was bad.

> in-memory hashtable/dictionary lookup (even with a generic hash function
> that doesn't speak French).
> Why can't you hold that in memory?

My original implementation use a memory hasttable, but this hashtable
use more than my 12 Go of memory.
My second implementation use a cached hastable with  "ehcache", with
this cache a can manage the memory but the read performance was very
bad. This learning take more than 10 days with two cluster node.


> Is your question that you don't want to regenerate that hashtable all the
> time,
> and that's why you'd like to store it on disk in HDF5?
> (Again, HDF5 has no DBMS like query engine and I don't see why you'd need
> that.)


My issue is not a serialization trouble but a trouble for read a this
hashtable when it was to large for memory.
I try DBMS but the performance where more bad than cached hashTable,
index on the word is not discriminant.

I hope that a binary file, help me.

I want to have a java object with a Add and GetValueByKey method, the
more important for me is the read performance, this structure will be
read for very long time (more 10 days, with memory hashtable).

Thanks for your help and excuse my bad English

Regards




2011/7/19 Gerd Heber <[email protected]>:
> Kim, how are you? Out of curiosity, I would like to clarify something.
>
> My naïve impression was that, if your keys are French words, the dataset
> can't be very large.
> Let's say that a highly educated English speaker has about 300,000 words in
> his/her vocabulary.
> Let's give a French speaker 500,000 words. Add two integers (8 bytes per
> entry) as payload.
> Now, I don't know what the histogram for French word length looks like.
> Most words are probably less than 16-20 characters long (with a long flat
> tail).
> So even at 100 bytes per entry, we'd be looking at a ~100 MB hash table.
> (Even on a smart phone that's not a terrible size.)
> No matter how you organize the dataset(s) in HDF5 on disk, you are not going
> to beat an
> in-memory hashtable/dictionary lookup (even with a generic hash function
> that doesn't speak French).
> Why can't you hold that in memory?
>
> Is your question that you don't want to regenerate that hashtable all the
> time,
> and that's why you'd like to store it on disk in HDF5?
> (Again, HDF5 has no DBMS like query engine and I don't see why you'd need
> that.)
>
> Best, G.
>
>
>
> _______________________________________________
> Hdf-forum is for HDF software users discussion.
> [email protected]
> http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org
>

_______________________________________________
Hdf-forum is for HDF software users discussion.
[email protected]
http://mail.hdfgroup.org/mailman/listinfo/hdf-forum_hdfgroup.org

Reply via email to