I have to clarify this as many have asked this before. Mahout's Implementation is Top K FPGrowth that finds closed patterns
if {1} - 3 is a pattern {1, 2}- 3 is another top pattern of item 1 {1, 2} - 3 is a closed pattern of {1} -3 as there is no information that the former gives that that the latter gives extra, this gives significant boost in speed for the Mahout FP-Growth algorithm. If the closed pattern is not there, then there should be a problem with implementation otherwise its an algorithm choice. I am forgetting the paper form which I implemented this. I will post that when I find it. Robin On Thu, Apr 15, 2010 at 4:03 AM, Robert Neumayer <neuma...@idi.ntnu.no>wrote: > Hi. > > I was trying to test the fpgrowth algorithm on a toy example and I kind of > fail > to get the correct results ... so it'd be nice if someone could take a look > at > my (admittedly very ad-hoc) code and tell me whether there's something > wrong with it. I'm running the mahout svn version. > > I tried to run the following example (input): > > 1 3 4 > 2 3 5 > 1 2 3 5 > 2 5 > 1 2 3 5 > > With the following code: > public static void main(String[] args) { > int minSupport = 2; > int maxHeapSize = 100; > > String input = "inputfile"; > String output = "output"; > > FPGrowth<String> fp = new FPGrowth<String>(); > FileSystem fs = new RawLocalFileSystem(); > Configuration conf = new Configuration(); > > String pattern = "[\\ ]"; > try { > fs = FileSystem.get(URI.create(output), conf); > > SequenceFile.Writer writer = null; > writer = new SequenceFile.Writer(fs, conf, new Path(output), > Text.class, TopKStringPatterns.class); > > Charset encoding = Charset.forName("UTF-8"); > > List<Pair<String, Long>> generateFList = null; > generateFList = fp.generateFList(new StringRecordIterator(new > FileLineIterable(new File(input), encoding, false), > pattern), minSupport); > > StringRecordIterator transactions = null; > > transactions = new StringRecordIterator(new FileLineIterable(new > File(input), encoding, false), pattern); > > > List<Text> keyList = new LinkedList<Text>(); > List<TopKStringPatterns> valueList = new > LinkedList<TopKStringPatterns>(); > > StringOutputCollector<Text, TopKStringPatterns> collector = new > StringOutputCollector<Text, TopKStringPatterns>( > keyList, valueList); > StringOutputConverter soc = new > StringOutputConverter(collector); > > StatusUpdater updater = new StatusUpdater() { > @Override > public void update(String status) { > System.out.println(status); > } > }; > > fp.generateTopKFrequentPatterns(transactions, generateFList, > minSupport, maxHeapSize, null, soc, updater); > writer.close(); > fs.close(); > > System.out.println("list.size: " + valueList.size()); > HashSet<List<String>> unique = new HashSet<List<String>>(); > for (int i = 0; i < valueList.size(); i++) { > > System.out.println(keyList.get(i) + " / " + > valueList.get(i)); > Iterator<Pair<List<String>, Long>> iterator = > valueList.get(i).iterator(); > while (iterator.hasNext()) { > > unique.add(iterator.next().getFirst()); > } > } > // print a bit more clearly > Iterator<List<String>> iterator = unique.iterator(); > while (iterator.hasNext()) { > System.out.println(iterator.next()); > } > > } catch (IOException e) { > e.printStackTrace(); > } > } > > I wrote another collector class (I know, that was probably not necessary > but I > wanted to use the API a bit): > > public class StringOutputCollector<K extends Writable, V extends Writable> > implements OutputCollector<K, V> { > > private final List<K> keyList; > > > private final List<V> valueList; > > public StringOutputCollector(List<K> keyList, List<V> valueList) { > this.keyList = keyList; > this.valueList = valueList; > } > > @Override > public final void collect(K key, V value) throws IOException { > keyList.add(key); > valueList.add(value); > } > > } > > The output I get is the following: > 1 / ([1, 3],3), ([1, 2, 3, 5],2) > 5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2) > 3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2) > 2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2) > > But I think there's more frequent itemsets to find in these transactions > (with > min support 2): > > {1} 3 > {2} 4 > {3} 4 > {5} 4 > {1, 2} 2 > {1, 3} 3 > {1, 5} 2 > {2, 3} 3 > {2, 5} 4 > {3, 5} 3 > {1, 2, 3} 2 > {1, 2, 5} 3 > {1, 3, 5} 2 > {2, 3, 5} 3 > {1, 2, 3, 5} 2 > > Also, some of them are duplicates with the mahout code. I'm not sure > whether > it's supposed to be like that. > > I also tried another textbook example and that didn't produce the same > results > either. > > So am I doing something wrong with the topk selection (that's what I maybe > could think of as a stupid mistake of mine)? > > Any hints would be appreciated. > > R >