Github user nilmeier commented on the pull request:
https://github.com/apache/spark/pull/8563#issuecomment-145419186
Addressed Fred's Review Comments in the most recent update. I was not able
to fully address all issues raised...they are described below:
Review comments:
1) --> Try making the portion of the examples with input data more
condensed; perhaps
by reading the matrices from string constants
X Done.
2) --> There's an orphan/duplicate ScalaDoc comment at line 356 of
BlockMatrix.scala
X Done.
3) --> Remove carriage return on lines 370, 481, 569, 732
X Done.
4) --> Recommend adding a call to the new subtract method to the MLLib
programmer's guide
X Done.
5) --> New API calls to BlockMatrix should have corresponding PySpark APIs
O Mike Dusenberry has an open JIRA I will be meeting with Mike Dusenberry
who has
done Python API work before.
6) --> Error message at line 394 should print out the block sizes that
don't match
X added error message to .add and .subtract
7) --> The code at line 384 should multiply every element of b.head by -1
as far
as I can see
X Fixed this error.
8) --> Line 456 and 465-471 have wrong indentation
X Done.
9) --> Scaladoc at 474 should state that blockRowRange and blockColRange
are block
indexes, not row/column indexes
X Done.
10) --> In lines 460-463, consider making a single pass over the blocks
instead of
4 passes
O I couldn't see an easy way to do this that didn't introduce additional
layers of recursion
that may impact performance. I would like to add this to a later
revision, as the recursive
building procedures are a general design issue to consider here.
11) --> Add a note to SchurComplement that the current implementation
assumes that
a11 fits into memory on the driver
X Done.
12) --> Might want to use a case class in return type of blockLUtoSolver
O Will try to do this in the near future, but I didn't include the changes
in this revision.
13) --> Take a close look at the performance impact of the chains of
multiplications at line 811 when there are many levels of recursion
O We did some studies, and the code would benefit from refactoring to
remove recursion.
It would require a significant rewrite, and might be beyond the scope of
the initial
submission.
14) --> In recursiveSequencesBuild, you may want to break off more than one
block
from the upper left corner of the matrix; in many cases, the available
memory
on the driver can comfortably hold, say 10x10 blocks. You should be able to
query the SparkContext's memory information to determine how much heap
space is available for the in-memory portion of the computation. On a
side note, it would be good to produce a user-friendly error if it looks
like there is not enough local heap to process one block's data locally.
O This may require a significant rewrite to handle correctly....I would
like to try this for the next revision.
15) --> Might want to reuse buffers for the local data allocated at lines
623-629
to avoid triggering global GC at each level of recursion.
O I would like to explore this as well in the future, but I didn't address
it in this update.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]