[
https://issues.apache.org/jira/browse/MAHOUT-157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12752249#action_12752249
]
Isabel Drost commented on MAHOUT-157:
-------------------------------------
The formatting still looks a bit weird (spaces, line length etc.)
PFPGrowth, line 101, 183, 214 - please at least add a warning log message prior
to deleting pre-existing output path and document somewhere on the usage page
that default behaviour is deleting the output path, if exists. (I think that
differs from the implementation in lda - we need to agree on consistant
behaviour across mahout in such cases).
line 174 - the combiner is commented out?
ParallelCountingMapper - shouldn't you report status through the reporter
during mapping?
ParallelFPGrowthMapper - line 87 - please do not use e.printStackTrace() but
generate a regular log message and log the exception stack trace through the
logger.
I would love to see some more comments: the expected format of key and value,
the expected content of glist and flist.
ParallelFPGrowthReducer line 111 - don't use e.printStackTrace.
AggregatorReducer - line 91 same
Attribute/TreeNode - The code is pretty clear, still I would love to see some
more documentation on the overall data structure.
FrequentPatternMaxHeap - line 74 - Huh? Judging from the return value, you can
omit the comparison against null here. (line 81 same.
FPGrowth - line 42 - the method name should not start with a capital letter.
517 lines for implementing the whole algorithm in one class - looks a bit large
for me. Is it possible to split it up?
line 165 - converting from Integer to int and back again usually costs quite a
bit of performance. Is there a way to rely on primitives only, or implement
your own incrementable integer type? Btw., Integer.valueOf(1) should be
replaced by Integer.ONE - that should be quicker and prevent in-accuracies.
Type T - I think it would make the code better readable if T were given a
clearer name, something like TransactionType? Otherwise you need to document
what exactly T represents.
line 229: Would reformulating the while conditions as
"while(!tempNode.childNodes.isEmpty()) { ... } make the code clearer here?
line 239: Where does the magic number 6 come from here? Define as a constant
with a speaking name?
the two generatedSinglePathPatterns methods look rather similar - is it
possible to not copy the code but extract it into its own method or reuse one
in the other?
line 293 (and earlier): Where does the magic number 4 come from? Define
constant with speaking name?
line 506: Looks like a strange log message?
Concerning your idea of going the algorithm interface way for fpGrowth: If you
can already make out what the interface should look like, I think that would be
a good way to make it easier for future implementors of other frequent itemset
algorithms.
> Frequent Pattern Mining using Parallel FP-Growth
> ------------------------------------------------
>
> Key: MAHOUT-157
> URL: https://issues.apache.org/jira/browse/MAHOUT-157
> Project: Mahout
> Issue Type: New Feature
> Affects Versions: 0.2
> Reporter: Robin Anil
> Fix For: 0.2
>
> Attachments: MAHOUT-157-August-17.patch, MAHOUT-157-August-24.patch,
> MAHOUT-157-August-31.patch, MAHOUT-157-August-6.patch,
> MAHOUT-157-Combinations-BSD-License.patch,
> MAHOUT-157-Combinations-BSD-License.patch,
> MAHOUT-157-inProgress-August-5.patch, MAHOUT-157-September-5.patch
>
>
> Implement: http://infolab.stanford.edu/~echang/recsys08-69.pdf
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.