abhishekd0907 commented on a change in pull request #97:
URL: https://github.com/apache/bahir/pull/97#discussion_r463772633



##########
File path: 
sql-streaming-sqs/src/main/scala/org/apache/spark/sql/streaming/sqs/SqsClient.scala
##########
@@ -131,13 +131,24 @@ class SqsClient(sourceOptions: SqsSourceOptions,
     }
   }
 
+  private def extractS3Message(parsedBody: JValue): JValue = {
+    implicit val formats = DefaultFormats
+    sourceOptions.messageWrapper match {
+      case sourceOptions.S3MessageWrapper.None => parsedBody
+      case sourceOptions.S3MessageWrapper.SNS => parse((parsedBody \ 
"Message").extract[String])

Review comment:
       In my use case, I've seen some extra messages in the SQS queue in the 
following scenarios:
   - When S3 bucket is configured to SQS queue during pipeline setup
   - some configuration is changed 
   - some test messages. 
   
   There are not `ObjectCreate` or `ObjectRemoved ` messages and so they are 
not relevant for spark consumer application so it seemed reasonable to ignore 
and delete them. 
   
   But if the spark consumer receives too many unparsable messages, it means 
there might be an issue with the pipeline set up. so according to me, it makes 
sense to throw an error and abort the pipeline with the help of a finite max 
retries mechanism.
   
   Refetching and trying to parse the same message again may help if a 
legitimate message gets corrupted while fetching it from SQS. Can you list the 
scenarios where refetching and trying to parse the same message will help?
   
   Feel free to raise a ticket and start a PR. We can discuss this further over 
there.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to