LogisticRegressionWithLBFGS with large feature set

Pala M Muthaia Thu, 14 May 2015 15:50:00 -0700

Hi,

I am trying to validate our modeling data pipeline by running
LogisticRegressionWithLBFGS on a dataset with ~3.7 million features,
basically to compute AUC. This is on Spark 1.3.0.


I am using 128 executors with 4 GB each + driver with 8 GB. The number of
data partitions is 3072

The execution fails with the following messages:

*Total size of serialized results of 54 tasks (10.4 GB) is bigger than
spark.driver.maxResultSize (3.0 GB)*

The associated stage in the job is treeAggregate at StandardScaler.scala:52
<http://lsv-10.rfiserve.net:18080/history/application_1426202183036_633264/stages/stage?id=3&attempt=0>
:
The call stack looks as below:

org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:996)
org.apache.spark.mllib.feature.StandardScaler.fit(StandardScaler.scala:52)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:233)
org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:190)


I am trying to both understand why such large amount of data needs to be
passed back to driver as well as figure out a way around this. I also want
to understand how much memory is required, as a function of dataset size,
feature set size, and number of iterations performed, for future
experiments.

>From looking at the MLLib code, the largest data structure seems to be a
dense vector of the same size as feature set. I am not familiar with
algorithm or its implementation I would guess 3.7 million features would
lead to a constant multiple of ~3.7 * 8 ~ 30 MB. So how does the dataset
size become so large?

I looked into the treeAggregate and it looks like hierarchical aggregation.
If the data being sent to the driver is basically the aggregated
coefficients (i.e. dense vectors) for the final aggregation, can't the
dense vectors from executors be pulled in one at a time and merged in
memory, rather than pulling all of them in together? (This is totally
uneducated guess so i may be completely off here).

Is there a way to get this running?

Thanks,
pala

LogisticRegressionWithLBFGS with large feature set

Reply via email to