I have to clarify this as many have asked this before. Mahout's
Implementation is  Top K FPGrowth that finds closed patterns

Ok, I didn't know that, thanks for the clarification.

There's a paper explaining exactly this (is it this one you used for
implementing maybe?):

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.37.1102

R

On Thu, 15 Apr 2010, Robin Anil wrote:


if {1} - 3 is a pattern {1, 2}- 3 is another top pattern  of item 1

{1, 2} - 3 is a closed pattern of {1} -3 as there is no information that the
former gives that that the latter gives extra, this gives significant boost
in speed for the Mahout FP-Growth algorithm. If the closed pattern is not
there, then there should be a problem with implementation otherwise its an
algorithm choice. I am forgetting the paper form which I implemented this. I
will post that when I find it.

Robin


On Thu, Apr 15, 2010 at 4:03 AM, Robert Neumayer <neuma...@idi.ntnu.no>wrote:

Hi.

I was trying to test the fpgrowth algorithm on a toy example and I kind of
fail
to get the correct results ... so it'd be nice if someone could take a look
at
my (admittedly very ad-hoc) code and tell me whether there's something
wrong with it. I'm running the mahout svn version.

I tried to run the following example (input):

1 3 4
2 3 5
1 2 3 5
2 5
1 2 3 5

With the following code:
 public static void main(String[] args) {
       int minSupport = 2;
       int maxHeapSize = 100;

       String input = "inputfile";
       String output = "output";

       FPGrowth<String> fp = new FPGrowth<String>();
       FileSystem fs = new RawLocalFileSystem();
       Configuration conf = new Configuration();

       String pattern = "[\\ ]";
       try {
           fs = FileSystem.get(URI.create(output), conf);

           SequenceFile.Writer writer = null;
           writer = new SequenceFile.Writer(fs, conf, new Path(output),
Text.class, TopKStringPatterns.class);

           Charset encoding = Charset.forName("UTF-8");

           List<Pair<String, Long>> generateFList = null;
           generateFList = fp.generateFList(new StringRecordIterator(new
FileLineIterable(new File(input), encoding, false),
                   pattern), minSupport);

           StringRecordIterator transactions = null;

           transactions = new StringRecordIterator(new FileLineIterable(new
File(input), encoding, false), pattern);


           List<Text> keyList = new LinkedList<Text>();
           List<TopKStringPatterns> valueList = new
LinkedList<TopKStringPatterns>();

           StringOutputCollector<Text, TopKStringPatterns> collector = new
StringOutputCollector<Text, TopKStringPatterns>(
                   keyList, valueList);
           StringOutputConverter soc = new
StringOutputConverter(collector);

           StatusUpdater updater = new StatusUpdater() {
               @Override
               public void update(String status) {
                   System.out.println(status);
               }
           };

           fp.generateTopKFrequentPatterns(transactions, generateFList,
minSupport, maxHeapSize, null, soc, updater);
           writer.close();
           fs.close();

           System.out.println("list.size: " + valueList.size());
           HashSet<List<String>> unique = new HashSet<List<String>>();
           for (int i = 0; i < valueList.size(); i++) {

               System.out.println(keyList.get(i) + " / " +
valueList.get(i));
               Iterator<Pair<List<String>, Long>> iterator =
valueList.get(i).iterator();
               while (iterator.hasNext()) {

                   unique.add(iterator.next().getFirst());
               }
           }
           // print a bit more clearly
           Iterator<List<String>> iterator = unique.iterator();
           while (iterator.hasNext()) {
               System.out.println(iterator.next());
           }

       } catch (IOException e) {
           e.printStackTrace();
       }
   }

I wrote another collector class (I know, that was probably not necessary
but I
wanted to use the API a bit):

public class StringOutputCollector<K extends Writable, V extends Writable>
implements OutputCollector<K, V> {

   private final List<K> keyList;


   private final List<V> valueList;

   public StringOutputCollector(List<K> keyList, List<V> valueList) {
       this.keyList = keyList;
       this.valueList = valueList;
   }

   @Override
   public final void collect(K key, V value) throws IOException {
       keyList.add(key);
       valueList.add(value);
   }

}

The output I get is the following:
1 / ([1, 3],3), ([1, 2, 3, 5],2)
5 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)
3 / ([3],4), ([2, 3, 5],3), ([1, 3],3), ([1, 2, 3, 5],2)
2 / ([2, 5],4), ([2, 3, 5],3), ([1, 2, 3, 5],2)

But I think there's more frequent itemsets to find in these transactions
(with
min support 2):

{1}     3
{2}     4
{3}     4
{5}     4
{1, 2}  2
{1, 3}  3
{1, 5}  2
{2, 3}  3
{2, 5}  4
{3, 5}  3
{1, 2, 3}       2
{1, 2, 5}       3
{1, 3, 5}       2
{2, 3, 5}       3
{1, 2, 3, 5}    2

Also, some of them are duplicates with the mahout code. I'm not sure
whether
it's supposed to be like that.

I also tried another textbook example and that didn't produce the same
results
either.

So am I doing something wrong with the topk selection (that's what I maybe
could think of as a stupid mistake of mine)?

Any hints would be appreciated.

R




--
Robert Neumayer
Norwegian University of Science and Technology (NTNU)
Department of Computer and Information Science
Sem Sælands vei 7-9
NO-7491 Trondheim
http://www.idi.ntnu.no/~neumayer

Reply via email to