[
https://issues.apache.org/jira/browse/MAHOUT-122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12725125#action_12725125
]
Deneche A. Hakim commented on MAHOUT-122:
-----------------------------------------
bq. The 450 byte overhead per training instance seems a little bit high, but I
don't know the data well so it might be pretty reasonable. The original data
size was about 100 bytes.
I may be able to explain this overhead:
* First of all, the memory estimations that I've done didn't account for the
memory not yet garbage collected, so I've run the tests again and this time a
launched the Garbage Collector just after loading the data;
* In a separate run, I allocated a double[nb instances][nb attributes] and
noted how much memory is used
|| Dataset || Data size (nb instances x nb attributes) || Mem. used by
double[nb instances][nb attributes] || MUALD ||
| KDD 1% | 49.402 x 42 | 19.050.312 B | 22.331.360 B |
| KDD 10% | 494.021 x 42 | 178.094.200 B | 204.659.576 B |
| KDD 25% | 1.224.607 x 42 | 438.395.224 B | 500.341.256 B |
| KDD 50% | 2.449.215 x 42 | 873.266.456 B | 998.331.560 B |
Most of the overhead is caused by how the instances are represented in memory,
I'm using a DenseVector so all the attributes are stored in a double[], this
means that each attribute uses 8 B of memory. By examining the original data,
we can see that most of the attributes contain at most 3 digits and because
that are stored as text they take at most 4 B if we count the separator.
I suppose that the difference between MUALD and the memory used by double[][]
is caused by the way the jvm stores the references to the instances' objects.
> Random Forests Reference Implementation
> ---------------------------------------
>
> Key: MAHOUT-122
> URL: https://issues.apache.org/jira/browse/MAHOUT-122
> Project: Mahout
> Issue Type: Task
> Components: Classification
> Affects Versions: 0.2
> Reporter: Deneche A. Hakim
> Attachments: 2w_patch.diff, 3w_patch.diff, RF reference.patch
>
> Original Estimate: 25h
> Remaining Estimate: 25h
>
> This is the first step of my GSOC project. Implement a simple, easy to
> understand, reference implementation of Random Forests (Building and
> Classification). The only requirement here is that "it works"
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.