Hi.
I was trying to test the fpgrowth algorithm on a toy example and I kind of
fail
to get the correct results ... so it'd be nice if someone could take a look
at
my (admittedly very ad-hoc) code and tell me whether there's something
wrong with it. I'm running the mahout svn version.
I tried to run the following example (input):
1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5
With the following code:
public static void main(String[] args) {
int minSupport = 2;
int maxHeapSize = 100;
String input = "inputfile";
String output = "output";
FPGrowth<String> fp = new FPGrowth<String>();
FileSystem fs = new RawLocalFileSystem();
Configuration conf = new Configuration();
String pattern = "[\\ ]";
try {
fs = FileSystem.get(URI.create(output), conf);
SequenceFile.Writer writer = null;
writer = new SequenceFile.Writer(fs, conf, new Path(output),
Text.class, TopKStringPatterns.class);
Charset encoding = Charset.forName("UTF-8");
List<Pair<String, Long>> generateFList = null;
generateFList = fp.generateFList(new StringRecordIterator(new
FileLineIterable(new File(input), encoding, false),
pattern), minSupport);
StringRecordIterator transactions = null;
transactions = new StringRecordIterator(new FileLineIterable(new
File(input), encoding, false), pattern);
List<Text> keyList = new LinkedList<Text>();
List<TopKStringPatterns> valueList = new
LinkedList<TopKStringPatterns>();
StringOutputCollector<Text, TopKStringPatterns> collector = new
StringOutputCollector<Text, TopKStringPatterns>(
keyList, valueList);
StringOutputConverter soc = new
StringOutputConverter(collector);
StatusUpdater updater = new StatusUpdater() {
@Override
public void update(String status) {
System.out.println(status);
}
};
fp.generateTopKFrequentPatterns(transactions, generateFList,
minSupport, maxHeapSize, null, soc, updater);
writer.close();
fs.close();
System.out.println("list.size: " + valueList.size());
HashSet<List<String>> unique = new HashSet<List<String>>();
for (int i = 0; i < valueList.size(); i++) {
System.out.println(keyList.get(i) + " / " +
valueList.get(i));
Iterator<Pair<List<String>, Long>> iterator =
valueList.get(i).iterator();
while (iterator.hasNext()) {
unique.add(iterator.next().getFirst());
}
}
// print a bit more clearly
Iterator<List<String>> iterator = unique.iterator();
while (iterator.hasNext()) {
System.out.println(iterator.next());
}
} catch (IOException e) {
e.printStackTrace();
}
}
I wrote another collector class (I know, that was probably not necessary
but I
wanted to use the API a bit):
public class StringOutputCollector<K extends Writable, V extends Writable>
implements OutputCollector<K, V> {
private final List<K> keyList;
private final List<V> valueList;
public StringOutputCollector(List<K> keyList, List<V> valueList) {
this.keyList = keyList;
this.valueList = valueList;
}
@Override
public final void collect(K key, V value) throws IOException {
keyList.add(key);
valueList.add(value);
}
}
The output I get is the following:
1 / ([1, 3],3), ([1, 2, 3, 5],2)
5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2)
2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
But I think there's more frequent itemsets to find in these transactions
(with
min support 2):
{1} 3
{2} 4
{3} 4
{5} 4
{1, 2} 2
{1, 3} 3
{1, 5} 2
{2, 3} 3
{2, 5} 4
{3, 5} 3
{1, 2, 3} 2
{1, 2, 5} 3
{1, 3, 5} 2
{2, 3, 5} 3
{1, 2, 3, 5} 2
Also, some of them are duplicates with the mahout code. I'm not sure
whether
it's supposed to be like that.
I also tried another textbook example and that didn't produce the same
results
either.
So am I doing something wrong with the topk selection (that's what I maybe
could think of as a stupid mistake of mine)?
Any hints would be appreciated.
R