abhishekd0907 commented on a change in pull request #97:
URL: https://github.com/apache/bahir/pull/97#discussion_r463772633
##########
File path:
sql-streaming-sqs/src/main/scala/org/apache/spark/sql/streaming/sqs/SqsClient.scala
##########
@@ -131,13 +131,24 @@ class SqsClient(sourceOptions: SqsSourceOptions,
}
}
+ private def extractS3Message(parsedBody: JValue): JValue = {
+ implicit val formats = DefaultFormats
+ sourceOptions.messageWrapper match {
+ case sourceOptions.S3MessageWrapper.None => parsedBody
+ case sourceOptions.S3MessageWrapper.SNS => parse((parsedBody \
"Message").extract[String])
Review comment:
In my use case, I've seen some extra messages in the SQS queue in the
following scenarios:
- When S3 bucket is configured to SQS queue during pipeline setup
- some configuration is changed
- some test messages.
There are not `ObjectCreate` or `ObjectRemoved ` messages and so they are
not relevant for spark consumer application so it seemed reasonable to ignore
and delete them.
But if the spark consumer receives too many unparsable messages, it means
there might be an issue with the pipeline set up. so according to me, it makes
sense to throw an error and abort the pipeline with the help of a finite max
retries mechanism.
Refetching and trying to parse the same message again may help if a
legitimate message gets corrupted while fetching it from SQS. Can you list the
scenarios where refetching and trying to parse the same message will help?
Feel free to raise a ticket and start a PR. We can discuss this further over
there.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]