sahnib commented on code in PR #44961: URL: https://github.com/apache/spark/pull/44961#discussion_r1489749903
########## sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/ListStateImpl.scala: ########## @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.execution.streaming + +import org.apache.spark.internal.Logging +import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder +import org.apache.spark.sql.execution.streaming.state.{StateStore, StateStoreErrors} +import org.apache.spark.sql.streaming.ListState + +/** + * Provides concrete implementation for list of values associated with a state variable + * used in the streaming transformWithState operator. + * + * @param store - reference to the StateStore instance to be used for storing state + * @param stateName - name of logical state partition + * @tparam S - data type of object that will be stored in the list + */ +class ListStateImpl[S](store: StateStore, + stateName: String, + keyExprEnc: ExpressionEncoder[Any]) + extends ListState[S] with Logging { + + /** Whether state exists or not. */ + override def exists(): Boolean = { + val encodedGroupingKey = StateTypesEncoderUtils.encodeGroupingKey(stateName, keyExprEnc) + val stateValue = store.get(encodedGroupingKey, stateName) + stateValue != null + } + + /** Get the state value if it exists. If the state does not exist in state store, an + * empty iterator is returned. */ + override def get(): Iterator[S] = { + val encodedKey = StateTypesEncoderUtils.encodeGroupingKey(stateName, keyExprEnc) + val unsafeRowValuesIterator = store.valuesIterator(encodedKey, stateName) + new Iterator[S] { + override def hasNext: Boolean = { + unsafeRowValuesIterator.hasNext + } + + override def next(): S = { + val valueUnsafeRow = unsafeRowValuesIterator.next() + StateTypesEncoderUtils.decodeValue(valueUnsafeRow) + } + } + } + + /** Get the list value as an option if it exists and None otherwise. */ + override def getOption(): Option[Iterator[S]] = { + Option(get()) + } + + /** Update the value of the list. */ + override def put(newState: Array[S]): Unit = { + validateNewState(newState) + + if (newState.isEmpty) { + this.clear() Review Comment: Thanks for pointing this out. With current implementation, we cannot distinguish a empty list from a null value. In RocksDB, we store each element of the list (encoded via MultipleValuesEncoder), and RocksDB merges them using `StringAppendOperator`. The merge happens asynchronously at compaction, and its possible that different elements of the list are present in different SST files. As we store each element separately (by calling `merge`), we cannot distinguish an empty list from null value. Having said that, the multi-valued encoder encodes the value like `|---size(bytes)--|--unsafeRowEncodedBytes--|`. Thinking out loud, we could technically create a special value which has `size(bytes) = 0` to represent an empty list. When merging values from read, we can discard this value if any actual non zero values are present. If no non-zero values are present, we can return an empty list. As of now, I think we should remove the `getOption` method from ListState, and return an empty Iterator from `get()` if there are no merged values in RocksDB. We can support empty list in the future and add the complexity (as mentioned in paragraph above) if customers want this support. cc: @anishshri-db Thoughts? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
