[ 
https://issues.apache.org/jira/browse/IGNITE-12849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072913#comment-17072913
 ] 

Glenn Wiebe commented on IGNITE-12849:
--------------------------------------

Thanks Alexey.

Yes, I am ready to create a PR for these items. I would like the dev list to 
review for completeness, efficiency, etc.

-For the "String method for vectors" (as opposed to my String Coordinate-based 
feature method for BinaryObjectVectorizer) I am not sure what you are thinking 
here. Do keep in mind that MOST of the Vector object usage in Ignite ML is for 
the various (composite) Matrix functionality and their (underlying) math 
operations, and the focus of these structures and their operations is floating 
point arithmetic.-

-Now, there IS already a "LabeledVector" structure that extends the notion of a 
Vector when it is stored as a "Row" in a Dataset Structure (i.e. a Dataset is a 
set of Vectors). So a LabeledVector is an extension of the DatasetRow and this 
is where a small part of String functionality is added to a Vector; i.e. a Row: 
a LabelVector version of a row can also add a String Label to the 
Float/Double-based Vector. In other words, when dealing with a base Vector you 
only have floats/doubles, but when you store a vector in a set, as a row, of 
the dataset, you can add a String Label to the row.-

-Let me know what you were thinking if I misunderstood your comment about 
String method.-

I wrote the long-winded answer above, but then realized you probably meant 
"toString()" for Vector. In my simple/isolated implementation created a new 
VerboseDenseVector object, but I am certainly on board with updating the base 
Vector object (e.g. AbstractVector, or maybe in both Dense and Sparse Vector 
objects if necessary) with something that is not too costly but can give 
clients useful information via the toString() method.

Let me know how to proceed with a PR, or the actual code I have for my test 
project.

> Add New BinaryObject Vectorizer for SparseVectors and Integer Coordinates
> -------------------------------------------------------------------------
>
>                 Key: IGNITE-12849
>                 URL: https://issues.apache.org/jira/browse/IGNITE-12849
>             Project: Ignite
>          Issue Type: New Feature
>          Components: ml
>    Affects Versions: 2.8
>            Reporter: Glenn Wiebe
>            Assignee: Alexey Zinoviev
>            Priority: Minor
>             Fix For: 2.9
>
>
> A. DenseVector-based BinaryObjectVectorizer
> When using existing caches as a source of Datasets, the 
> BinaryObjectVectorizer is used.
> The existing BinaryObjectVectorizer only supports the creation of a 
> SparseVector.
> The LUDecomposition utility that supports gaussian factorization for models 
> like GMM have a "Singularity indicator" for which a SparseVector and its null 
> handling will set a matrix column calculation to be zero/0.0 which is below 
> the minimum check value (1e-11) and thus indicate a matrix is not square. 
> This null handling of the SparseMatrix will restrict the use of some 
> algorithms like Gaussian Mixture Models where any Vector dimension that is 
> null will incorrectly signal that a matrix is not square.
> It would be great if we could:
> - Have a BinaryObjectVectorizer that uses a DenseMatrix to eliminate this 
> singularity trigger and enable use of GMM Trainer.
> B. CacheBasedDatasets not treated as Temporary Cache
> When using a cache-based dataset, the close() method destroys the Ignite 
> cache. This means that there is no ability to re-use the data loaded into 
> this dataset.
> It would be great if we could:
> - Not destroy the Ignite Cache holding the dataset on close (of one step in 
> an ML processing flow)
> - Allow for "attaching" to this prior, pre-calculated dataset in subsequent 
> use.
> C. Vector Visibility
> Vectors (unlike other value types, e.g. BinaryObjects) are not visible in 
> standard mechanisms, like the Ignite Web Console, where the toString() method 
> does not present any information about the embedded vector values.
> It would be great if we could:
> - have a Vector.toString() method implementation that presented some 
> information about what is actually in the Vector.
> I have implemented the above items and have used them at a customer where I 
> needed these capabilities (or at least it dramatically reduced the cost and 
> increased the value of the solution).
> It would be great if the community was supportive of this 
> expansion/improvement of the Ignite ML library.
> Thanks,
>   Glenn



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to