[
https://issues.apache.org/jira/browse/MAHOUT-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14071882#comment-14071882
]
ASF GitHub Bot commented on MAHOUT-1500:
----------------------------------------
Github user cliffclick commented on the pull request:
https://github.com/apache/mahout/pull/21#issuecomment-49894450
This is a very basic port, focused on correctness & completeness, with no
effort for performance.
Expectation Setting: There's easy 2x to 10x speedups in most of the
operator inner loops. The HDFS sequence-file reader/writers are
single-threaded-single-node; H2O's internal CSV reader will be easily 100x
faster.
Performance work should be in later commits.
Minor comments:
Lots of places, esp reduce() calls, could/should call
ArrayUtils.add(this,that) instead of a loop over the arrays being added.
H2OHelper.empty_frame looks a ton like it should call "Vec.makeZero()" in a
loop instead of hand rolling Vecs of zeros; there's a version which will take a
hand-rolled layout. This call probably should move into Frame class directly.
The technique for row-labeling seems... awkward at best. Or at least I'm
reading that to be the purpose of using Tuple2. I think this design needs more
exploring - e.g. insert a row-column in front of the "normal" Frame columns,
and teach the follow-on code to skip 1st column. Note that many datasets have
non-numeric cols (e.g. name, address) that cannot participate in math ops, and
so most H2O algos already carry forward a notion of a set of columns being
worked on.
Cliff
> H2O integration
> ---------------
>
> Key: MAHOUT-1500
> URL: https://issues.apache.org/jira/browse/MAHOUT-1500
> Project: Mahout
> Issue Type: Improvement
> Reporter: Anand Avati
> Fix For: 1.0
>
>
> Provide H2O backend for the Mahout DSL
--
This message was sent by Atlassian JIRA
(v6.2#6252)