[ https://issues.apache.org/jira/browse/IGNITE-12849?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17072913#comment-17072913 ]
Glenn Wiebe commented on IGNITE-12849: -------------------------------------- Thanks Alexey. Yes, I am ready to create a PR for these items. I would like the dev list to review for completeness, efficiency, etc. -For the "String method for vectors" (as opposed to my String Coordinate-based feature method for BinaryObjectVectorizer) I am not sure what you are thinking here. Do keep in mind that MOST of the Vector object usage in Ignite ML is for the various (composite) Matrix functionality and their (underlying) math operations, and the focus of these structures and their operations is floating point arithmetic.- -Now, there IS already a "LabeledVector" structure that extends the notion of a Vector when it is stored as a "Row" in a Dataset Structure (i.e. a Dataset is a set of Vectors). So a LabeledVector is an extension of the DatasetRow and this is where a small part of String functionality is added to a Vector; i.e. a Row: a LabelVector version of a row can also add a String Label to the Float/Double-based Vector. In other words, when dealing with a base Vector you only have floats/doubles, but when you store a vector in a set, as a row, of the dataset, you can add a String Label to the row.- -Let me know what you were thinking if I misunderstood your comment about String method.- I wrote the long-winded answer above, but then realized you probably meant "toString()" for Vector. In my simple/isolated implementation created a new VerboseDenseVector object, but I am certainly on board with updating the base Vector object (e.g. AbstractVector, or maybe in both Dense and Sparse Vector objects if necessary) with something that is not too costly but can give clients useful information via the toString() method. Let me know how to proceed with a PR, or the actual code I have for my test project. > Add New BinaryObject Vectorizer for SparseVectors and Integer Coordinates > ------------------------------------------------------------------------- > > Key: IGNITE-12849 > URL: https://issues.apache.org/jira/browse/IGNITE-12849 > Project: Ignite > Issue Type: New Feature > Components: ml > Affects Versions: 2.8 > Reporter: Glenn Wiebe > Assignee: Alexey Zinoviev > Priority: Minor > Fix For: 2.9 > > > A. DenseVector-based BinaryObjectVectorizer > When using existing caches as a source of Datasets, the > BinaryObjectVectorizer is used. > The existing BinaryObjectVectorizer only supports the creation of a > SparseVector. > The LUDecomposition utility that supports gaussian factorization for models > like GMM have a "Singularity indicator" for which a SparseVector and its null > handling will set a matrix column calculation to be zero/0.0 which is below > the minimum check value (1e-11) and thus indicate a matrix is not square. > This null handling of the SparseMatrix will restrict the use of some > algorithms like Gaussian Mixture Models where any Vector dimension that is > null will incorrectly signal that a matrix is not square. > It would be great if we could: > - Have a BinaryObjectVectorizer that uses a DenseMatrix to eliminate this > singularity trigger and enable use of GMM Trainer. > B. CacheBasedDatasets not treated as Temporary Cache > When using a cache-based dataset, the close() method destroys the Ignite > cache. This means that there is no ability to re-use the data loaded into > this dataset. > It would be great if we could: > - Not destroy the Ignite Cache holding the dataset on close (of one step in > an ML processing flow) > - Allow for "attaching" to this prior, pre-calculated dataset in subsequent > use. > C. Vector Visibility > Vectors (unlike other value types, e.g. BinaryObjects) are not visible in > standard mechanisms, like the Ignite Web Console, where the toString() method > does not present any information about the embedded vector values. > It would be great if we could: > - have a Vector.toString() method implementation that presented some > information about what is actually in the Vector. > I have implemented the above items and have used them at a customer where I > needed these capabilities (or at least it dramatically reduced the cost and > increased the value of the solution). > It would be great if the community was supportive of this > expansion/improvement of the Ignite ML library. > Thanks, > Glenn -- This message was sent by Atlassian Jira (v8.3.4#803005)