Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/79#issuecomment-39392123
Hi Hirakendu - thanks for all the detailed suggestions and information. I
will reply to that separately.
One question - you say there are 500,000 examples and this equates to 90GB
of raw data. If that's the case, this works out to ~200KB per example - is that
right or are you off by an order of magnitude in either the number of features
or the number of data points? Or are we throwing a bunch of data out before
fitting?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---