pitrou commented on a change in pull request #7477:
URL: https://github.com/apache/arrow/pull/7477#discussion_r443707259
##########
File path: python/pyarrow/tensor.pxi
##########
@@ -339,6 +350,15 @@ shape: {0.shape}""".format(self)
def non_zero_length(self):
return self.stp.non_zero_length()
+ @property
+ def has_canonical_format(self):
+ cdef:
+ _CSparseCOOIndexPtr csi
+
+ csi = dynamic_cast[_CSparseCOOIndexPtr](self.stp.sparse_index().get())
Review comment:
Why the `dynamic_cast`? Since this is a sparse COO tensor, the cast
should always succeed. Just use a normal Cython cast.
##########
File path: format/SparseTensor.fbs
##########
@@ -63,6 +63,9 @@ table SparseTensorIndexCOO {
/// The location and size of the indices matrix's data
indicesBuffer: Buffer (required);
+
+ /// The canonicality flag
Review comment:
What is this? Can you explain a bit more?
##########
File path: python/pyarrow/tensor.pxi
##########
@@ -270,8 +279,10 @@ shape: {0.shape}""".format(self)
&out_data, &out_coords))
data = PyObject_to_object(out_data)
coords = PyObject_to_object(out_coords)
- result = coo_matrix((data[:, 0], (coords[:, 0], coords[:, 1])),
- shape=self.shape)
+ row, col = coords[:, 0], coords[:, 1]
+ result = coo_matrix((data[:, 0], (row, col)), shape=self.shape)
+ if self.has_canonical_format:
+ result.sum_duplicates()
Review comment:
Why?
##########
File path: cpp/src/arrow/sparse_tensor_test.cc
##########
@@ -49,7 +49,10 @@ static inline void AssertCOOIndex(const
std::shared_ptr<Tensor>& sidx, const int
}
}
-TEST(TestSparseCOOIndex, Make) {
+//-----------------------------------------------------------------------------
Review comment:
Perhaps there should also be some tests with empty/all-zero tensors?
##########
File path: python/pyarrow/tensor.pxi
##########
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
for x in dim_names:
c_dim_names.push_back(tobytes(x))
- coords = np.vstack([obj.row, obj.col]).T
+ row = obj.row
+ col = obj.col
+ if obj.has_canonical_format:
+ order = np.lexsort((col, row)) # row-major order
Review comment:
I don't understand this. If Scipy says the indices are sorted, why do
you need to sort them again?
##########
File path: cpp/src/arrow/sparse_tensor.cc
##########
@@ -298,16 +298,137 @@ inline Status CheckSparseCOOIndexValidity(const
std::shared_ptr<DataType>& type,
return Status::OK();
}
+void get_coo_index_tensor_row(const std::shared_ptr<Tensor>& coords, const
int64_t row,
+ std::vector<int64_t>* out_index) {
+ const auto& fw_index_value_type =
+ internal::checked_cast<const FixedWidthType&>(*coords->type());
+ const size_t indices_elsize = fw_index_value_type.bit_width() / CHAR_BIT;
+
+ const auto& shape = coords->shape();
+ const int64_t non_zero_length = shape[0];
+ DCHECK(0 <= row && row < non_zero_length);
+
+ const int64_t ndim = shape[1];
+ std::vector<int64_t> index;
+ index.reserve(ndim);
Review comment:
You can write directly into `out_index`:
```c++
out_index->resize(ndim);
// ...
for (int64_t i = 0; i < ndim; ++i) {
(*out_index)[i] =
static_cast<int64_t>(coords->Value<UInt8Type>({row, i}));
}
```
##########
File path: python/pyarrow/tensor.pxi
##########
@@ -199,7 +202,13 @@ shape: {0.shape}""".format(self)
for x in dim_names:
c_dim_names.push_back(tobytes(x))
- coords = np.vstack([obj.row, obj.col]).T
+ row = obj.row
+ col = obj.col
+ if obj.has_canonical_format:
Review comment:
Does this attribute exist? I don't see it in
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html#scipy.sparse.coo_matrix
In
https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html#scipy.sparse.csr_matrix,
I see an attribute named `has_sorted_indices`
##########
File path: cpp/src/arrow/sparse_tensor.cc
##########
@@ -298,16 +298,137 @@ inline Status CheckSparseCOOIndexValidity(const
std::shared_ptr<DataType>& type,
return Status::OK();
}
+void get_coo_index_tensor_row(const std::shared_ptr<Tensor>& coords, const
int64_t row,
Review comment:
Can you follow the style guide? Use CamelCase for function names.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]