Hi. I was trying to test the fpgrowth algorithm on a toy example and I kind of fail to get the correct results ... so it'd be nice if someone could take a look at my (admittedly very ad-hoc) code and tell me whether there's something wrong with it. I'm running the mahout svn version.
I tried to run the following example (input): 1 3 4 2 3 5 1 2 3 5 2 5 1 2 3 5 With the following code: public static void main(String[] args) { int minSupport = 2; int maxHeapSize = 100; String input = "inputfile"; String output = "output"; FPGrowth<String> fp = new FPGrowth<String>(); FileSystem fs = new RawLocalFileSystem(); Configuration conf = new Configuration(); String pattern = "[\\ ]"; try { fs = FileSystem.get(URI.create(output), conf); SequenceFile.Writer writer = null; writer = new SequenceFile.Writer(fs, conf, new Path(output), Text.class, TopKStringPatterns.class); Charset encoding = Charset.forName("UTF-8"); List<Pair<String, Long>> generateFList = null; generateFList = fp.generateFList(new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern), minSupport); StringRecordIterator transactions = null; transactions = new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern); List<Text> keyList = new LinkedList<Text>(); List<TopKStringPatterns> valueList = new LinkedList<TopKStringPatterns>(); StringOutputCollector<Text, TopKStringPatterns> collector = new StringOutputCollector<Text, TopKStringPatterns>( keyList, valueList); StringOutputConverter soc = new StringOutputConverter(collector); StatusUpdater updater = new StatusUpdater() { @Override public void update(String status) { System.out.println(status); } }; fp.generateTopKFrequentPatterns(transactions, generateFList, minSupport, maxHeapSize, null, soc, updater); writer.close(); fs.close(); System.out.println("list.size: " + valueList.size()); HashSet<List<String>> unique = new HashSet<List<String>>(); for (int i = 0; i < valueList.size(); i++) { System.out.println(keyList.get(i) + " / " + valueList.get(i)); Iterator<Pair<List<String>, Long>> iterator = valueList.get(i).iterator(); while (iterator.hasNext()) { unique.add(iterator.next().getFirst()); } } // print a bit more clearly Iterator<List<String>> iterator = unique.iterator(); while (iterator.hasNext()) { System.out.println(iterator.next()); } } catch (IOException e) { e.printStackTrace(); } } I wrote another collector class (I know, that was probably not necessary but I wanted to use the API a bit): public class StringOutputCollector<K extends Writable, V extends Writable> implements OutputCollector<K, V> { private final List<K> keyList; private final List<V> valueList; public StringOutputCollector(List<K> keyList, List<V> valueList) { this.keyList = keyList; this.valueList = valueList; } @Override public final void collect(K key, V value) throws IOException { keyList.add(key); valueList.add(value); } } The output I get is the following: 1 / ([1, 3],3), ([1, 2, 3, 5],2) 5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2) 3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2) 2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2) But I think there's more frequent itemsets to find in these transactions (with min support 2): {1} 3 {2} 4 {3} 4 {5} 4 {1, 2} 2 {1, 3} 3 {1, 5} 2 {2, 3} 3 {2, 5} 4 {3, 5} 3 {1, 2, 3} 2 {1, 2, 5} 3 {1, 3, 5} 2 {2, 3, 5} 3 {1, 2, 3, 5} 2 Also, some of them are duplicates with the mahout code. I'm not sure whether it's supposed to be like that. I also tried another textbook example and that didn't produce the same results either. So am I doing something wrong with the topk selection (that's what I maybe could think of as a stupid mistake of mine)? Any hints would be appreciated. R