[
https://issues.apache.org/jira/browse/SPARK-7568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14541374#comment-14541374
]
DB Tsai edited comment on SPARK-7568 at 5/13/15 5:37 AM:
---------------------------------------------------------
In 1.3,
https://github.com/apache/spark/blob/branch-1.3/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
the intercept is false!
I just check that in 1.4 before I introduced new LOR, we changed the intercept
to default of true. In this case, instance 6 will has prediction 0.0.
I confirmed that if we turn off the intercept and with the same regularization,
the prob is the same as 1.3. Maybe we should just turn off the intercept since
it's high dim problem, and intercept is not important most of time.
was (Author: dbtsai):
In 1.3,
https://github.com/apache/spark/blob/branch-1.3/mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala
the intercept is false!
I just check that in 1.4 before I introduced new LOR, we changed the intercept
to default of true. In this case, instance will has prediction 0.0.
I confirmed that if we turn off the intercept and with the same regularization,
the prob is the same as 1.3. Maybe we should just turn off the intercept since
it's high dim problem, and intercept is not important most of time.
> ml.LogisticRegression doesn't output the right prediction
> ---------------------------------------------------------
>
> Key: SPARK-7568
> URL: https://issues.apache.org/jira/browse/SPARK-7568
> Project: Spark
> Issue Type: Bug
> Components: ML
> Affects Versions: 1.4.0
> Reporter: Xiangrui Meng
> Assignee: DB Tsai
> Priority: Blocker
>
> `bin/spark-submit
> examples/src/main/python/ml/simple_text_classification_pipeline.py`
> {code}
> Row(id=4, text=u'spark i j k', words=[u'spark', u'i', u'j', u'k'],
> features=SparseVector(262144, {105: 1.0, 106: 1.0, 107: 1.0, 62173: 1.0}),
> rawPrediction=DenseVector([0.1629, -0.1629]),
> probability=DenseVector([0.5406, 0.4594]), prediction=0.0)
> Row(id=5, text=u'l m n', words=[u'l', u'm', u'n'],
> features=SparseVector(262144, {108: 1.0, 109: 1.0, 110: 1.0}),
> rawPrediction=DenseVector([2.6407, -2.6407]),
> probability=DenseVector([0.9334, 0.0666]), prediction=0.0)
> Row(id=6, text=u'mapreduce spark', words=[u'mapreduce', u'spark'],
> features=SparseVector(262144, {62173: 1.0, 140738: 1.0}),
> rawPrediction=DenseVector([1.2651, -1.2651]),
> probability=DenseVector([0.7799, 0.2201]), prediction=0.0)
> Row(id=7, text=u'apache hadoop', words=[u'apache', u'hadoop'],
> features=SparseVector(262144, {128334: 1.0, 134181: 1.0}),
> rawPrediction=DenseVector([3.7429, -3.7429]),
> probability=DenseVector([0.9769, 0.0231]), prediction=0.0)
> {code}
> In Scala
> {code}
> $ bin/run-example ml.SimpleTextClassificationPipeline
> (4, spark i j k) --> prob=[0.5406433544851436,0.45935664551485655],
> prediction=0.0
> (5, l m n) --> prob=[0.9334382627383263,0.06656173726167364], prediction=0.0
> (6, mapreduce spark) --> prob=[0.7799076868203896,0.22009231317961045],
> prediction=0.0
> (7, apache hadoop) --> prob=[0.9768636139518304,0.023136386048169616],
> prediction=0.0
> {code}
> All predictions are 0, while some should be one based on the probability. It
> seems to be an issue with regularization.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]