Not sure if this is always ideal for Naive Bayes, but you could also hash the
features into a lower-dimensional space (e.g. reduce it to 50,000 features).
For each feature simply take MurmurHash3(featureID) % 5 for example.
Matei
On Apr 27, 2014, at 11:24 PM, DB Tsai wrote:
> Our customer
Our customer asked us to implement Naive Bayes which should be able to at
least train news20 one year ago, and we implemented for them in Hadoop
using distributed cache to store the model.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
How big is your problem and how many labels? -Xiangrui
On Sun, Apr 27, 2014 at 10:28 PM, DB Tsai wrote:
> Hi Xiangrui,
>
> We also run into this issue at Alpine Data Labs. We ended up using LRU cache
> to store the counts, and splitting those least used counts to distributed
> cache in HDFS.
>
>
Hi Xiangrui,
We also run into this issue at Alpine Data Labs. We ended up using LRU
cache to store the counts, and splitting those least used counts to
distributed cache in HDFS.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn:
Even the features are sparse, the conditional probabilities are stored
in a dense matrix. With 200 labels and 2 million features, you need to
store at least 4e8 doubles on the driver node. With multiple
partitions, you may need more memory on the driver. Could you try
reducing the number of partiti
I'm already using the SparseVector class.
~200 labels
On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng wrote:
> How many labels does your dataset have? -Xiangrui
>
> On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote:
> > Which version of mllib are you using? For Spark 1.0, mllib will
> > support
How many labels does your dataset have? -Xiangrui
On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote:
> Which version of mllib are you using? For Spark 1.0, mllib will
> support sparse feature vector which will improve performance a lot
> when computing the distance between points and centroid.
>
> S
Which version of mllib are you using? For Spark 1.0, mllib will
support sparse feature vector which will improve performance a lot
when computing the distance between points and centroid.
Sincerely,
DB Tsai
---
My Blog: https://www.dbtsai.com
Li
I'm just wondering are the SparkVector calculations really taking into
account the sparsity or just converting to dense?
On Fri, Apr 25, 2014 at 10:06 PM, John King wrote:
> I've been trying to use the Naive Bayes classifier. Each example in the
> dataset is about 2 million features, only about
I've been trying to use the Naive Bayes classifier. Each example in the
dataset is about 2 million features, only about 20-50 of which are
non-zero, so the vectors are very sparse. I keep running out of memory
though, even for about 1000 examples on 30gb RAM while the entire dataset
is 4 million ex
10 matches
Mail list logo