[ https://issues.apache.org/jira/browse/SPARK-10870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15110060#comment-15110060 ]
Yu Ishikawa commented on SPARK-10870: ------------------------------------- [~prudenko] Should we test the Kaggle data with the winning solution used GBDT encoder? > Criteo Display Advertising Challenge > ------------------------------------ > > Key: SPARK-10870 > URL: https://issues.apache.org/jira/browse/SPARK-10870 > Project: Spark > Issue Type: Sub-task > Components: ML > Reporter: Peter Rudenko > > Very useful dataset to test pipeline because of: > # "Big data" dataset - original Kaggle competition dataset is 12 gb, but > there's [1tb|http://labs.criteo.com/downloads/download-terabyte-click-logs/] > dataset of the same schema as well. > # Sparse models - categorical features has high cardinality > # Reproducible results - because it's public and many other distributed > machine learning libraries (e.g. > [wormwhole|https://github.com/dmlc/wormhole/blob/master/doc/tutorial/criteo_kaggle.rst], > [parameter > server|https://github.com/dmlc/parameter_server/blob/master/example/linear/criteo/README.md], > [azure > ml|https://azure.microsoft.com/en-us/documentation/articles/machine-learning-data-science-process-hive-criteo-walkthrough/#mltasks] > etc.) have made a base line benchmarks on which we could compare. > I have some base line results with custom models (GBDT encoders and tuned LR) > on spark-1.4. Will make pipelines using public spark model. [Winning > solution|http://www.csie.ntu.edu.tw/~r01922136/kaggle-2014-criteo.pdf] used > GBDT encoder (not available in spark, but not difficult to make one from GBT > from mllib) + hashing + factorization machine (planned for spark-1.6). -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org