Github user takuti commented on the issue:
https://github.com/apache/incubator-hivemall/pull/149
Evaluation has been conducted at:
[takuti/criteo-ffm](https://github.com/takuti/criteo-ffm). See the repository
for detail.
As an example, I have used tiny data provided at
[guestwalk/kaggle-2014-criteo](https://github.com/guestwalk/kaggle-2014-criteo)
which is already preprocessed and converted into the LIBFFM format:
- Split 2,000 samples in `train.tiny.csv` to:
- 1,587 training samples `tr.sp`
- 412 validation samples `va.sp`
As a consequence, FFM model created by LIBFFM and Hivemall with the
following (almost similar) configuration showed very similar training loss and
accuracy as follows.
**LIBFFM**:
```
$ ./ffm-train -k 4 -t 15 -l 0.00002 -r 0.2 -s 10 ../tr.sp model
iter tr_logloss tr_time
1 1.04980 0.0
2 0.53771 0.0
3 0.50963 0.0
4 0.48980 0.1
5 0.47469 0.1
6 0.46304 0.1
7 0.45289 0.1
8 0.44400 0.1
9 0.43653 0.1
10 0.42947 0.1
11 0.42330 0.1
12 0.41727 0.1
13 0.41130 0.1
14 0.40558 0.1
15 0.40036 0.1
```
> LogLoss on validation set `va.sp`: 0.47237
**Hivemall**:
```
$ hive --hiveconf hive.root.logger=INFO,console
hive> INSERT OVERWRITE TABLE criteo.ffm_model
> SELECT
> train_ffm(features, label, '-init_v random -max_init_value 1.0
-classification -iterations 15 -factors 4 -eta 0.2 -l2norm -optimizer sgd
-lambda 0.00002 -cv_rate 0.0 -disable_wi')
> FROM (
> SELECT
> features, label
> FROM
> criteo.train_vectorized
> CLUSTER BY rand(1)
> ) t
> ;
Record training examples to a file:
/var/folders/rg/6mhvj7h567x_ys7brmf2bb6w0000gn/T/hivemall_fm6211397472147242886.sgmt
Iteration #2 | average loss=0.5316043797079182, current cumulative
loss=843.6561505964662, previous cumulative loss=1214.5909560888044, change
rate=0.30539895232450376, #trainingExamples=1587
Iteration #3 | average loss=0.5065999656968238, current cumulative
loss=803.9741455608594, previous cumulative loss=843.6561505964662, change
rate=0.04703575622313853, #trainingExamples=1587
Iteration #4 | average loss=0.49634490612175397, current cumulative
loss=787.6993660152235, previous cumulative loss=803.9741455608594, change
rate=0.0202429140731664, #trainingExamples=1587
Iteration #5 | average loss=0.48804954980765963, current cumulative
loss=774.5346355447558, previous cumulative loss=787.6993660152235, change
rate=0.0167128869698916, #trainingExamples=1587
Iteration #6 | average loss=0.48072518575956447, current cumulative
loss=762.9108698004288, previous cumulative loss=774.5346355447558, change
rate=0.015007418920848658, #trainingExamples=1587
Iteration #7 | average loss=0.47402279755334875, current cumulative
loss=752.2741797171644, previous cumulative loss=762.9108698004288, change
rate=0.013942244768444403, #trainingExamples=1587
Iteration #8 | average loss=0.4677507471836629, current cumulative
loss=742.320435780473, previous cumulative loss=752.2741797171644, change
rate=0.013231537390308698, #trainingExamples=1587
Iteration #9 | average loss=0.4618142861358177, current cumulative
loss=732.8992720975427, previous cumulative loss=742.320435780473, change
rate=0.012691505216375798, #trainingExamples=1587
Iteration #10 | average loss=0.4561878517855827, current cumulative
loss=723.9701207837197, previous cumulative loss=732.8992720975427, change
rate=0.012183326759580433, #trainingExamples=1587
Iteration #11 | average loss=0.45087834343992406, current cumulative
loss=715.5439310391595, previous cumulative loss=723.9701207837197, change
rate=0.01163886395675921, #trainingExamples=1587
Iteration #12 | average loss=0.4458864402438874, current cumulative
loss=707.6217806670493, previous cumulative loss=715.5439310391595, change
rate=0.011071508021324606, #trainingExamples=1587
Iteration #13 | average loss=0.44118468270053807, current cumulative
loss=700.1600914457539, previous cumulative loss=707.6217806670493, change
rate=0.010544742156271002, #trainingExamples=1587
Iteration #14 | average loss=0.4367191822212713, current cumulative
loss=693.0733421851576, previous cumulative loss=700.1600914457539, change
rate=0.01012161268141256, #trainingExamples=1587
Iteration #15 | average loss=0.4324248854220929, current cumulative
loss=686.2582931648615, previous cumulative loss=693.0733421851576, change
rate=0.009833084906727563, #trainingExamples=1587
Performed 15 iterations of 1,587 training examples on memory (thus 23,805
training updates in total)
```
> LogLoss on the same validation set: 0.47604112308042346
Note that, since we used `-l2norm` option for training, the validation
samples should also be L2 normalized as:
`feature_pairs(l2_normalize(t1.features), '-ffm')`
While a choice of hyper-parameters and optimizer (SGD/FTRL/AdaGrad) affects
to the accuracy to some degree, I have noticed `-disable_wi` can be a more
important factor on this data. If we use the liner terms to train FFM model,
LogLoss on `va.sp` is significantly increased to `1.5227099483928919`.
I'm still not sure if the result is natural or be caused by a bug. Let me
double-check the implementation.
---