Hi,
I was trying to use NetflixDatasetConverter.java to prep training/probing
data for ALSWR.
I have obtained the netflix data.
I got the following exception:
Exception in thread "main" java.lang.IllegalStateException
at com.google.common.base.Preconditions.checkState
(Preconditions.java:161)
at
org.apache.mahout.cf.taste.hadoop.example.als.netflix.NetflixDatasetConverter.main
(NetflixDatasetConverter.java:135)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:76)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:607)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke
(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver
(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke
(NativeMethodAccessorImpl.java:76)
at sun.reflect.DelegatingMethodAccessorImpl.invoke
(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:607)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
I have looked at the code, I have some doubts:
I am not sure if Netflix's data set has different namings for the same
file.
For the dataset I obtained, I got training_set (directory, cotains 100M
data points), qualifying.txt (2.8M data points), and probe.txt (1.4M data
points).
According to the Netflix readme, training_set is the superset of probe.txt,
qualifying.txt is used by contesters to submit their predictions (thus, no
ground truth is given for qualifying.txt).
There is not such a file called "judging.txt", as suggested by
NetflixDatasetConverter.java's help. I gambled on probe.txt being the
"judging.txt".
However, I got the above exception.
On a side note, the naming of variable "probes" is a bit confusing for me,
as it is created by reading the file named "qualifying.txt",
and there is an actual file named probe.txt (at least from Netflix)
But what really matters is that at line 133
float rating = Float.parseFloat(SEPARATOR.split(line)[0]);
This implies judging.txt should contain the actual rating of the (user,
movie), which is not true for the probe.txt (it doesn't contain such rating
information).
Also,
Line 134-136:
Preference pref = probes.get(ratingsProcessed);
Preconditions.checkState(pref.getItemID() == currentMovieID);
ratingsProcessed++;
Seems to imply that qualifying.txt and judging.txt (probe.txt) have the
exactly same (user, movie ) pairs, the difference is judging.txt has the
rating, qualifying.txt doesn't.
This seems to go against what probe.txt contains and the fact that
probe.txt and qualifying.txt shall not overlap.
So what is this "judging.txt" file that I am supposed to provide and where
can I get it? Could anybody provide some pointers ?
Thanks,
Wei