uranusjr commented on a change in pull request #15680:
URL: https://github.com/apache/airflow/pull/15680#discussion_r627046024
##########
File path: airflow/providers/amazon/aws/transfers/mongo_to_s3.py
##########
@@ -117,7 +117,6 @@ def execute(self, context) -> bool:
mongo_collection=self.mongo_collection,
query=cast(dict, self.mongo_query),
mongo_db=self.mongo_db,
- allowDiskUse=self.allow_disk_use,
Review comment:
The `allow_disk_use` argument in `.find()` maps to MongoDB’s
[`cursor.allowDiskUse`](https://docs.mongodb.com/manual/reference/method/cursor.allowDiskUse/),
while `.aggregate()`’s `allowDiskUse` corresponds to [`allowDiskUse` in the
aggregation
pipeline](https://docs.mongodb.com/manual/reference/command/aggregate/#mongodb-dbcommand-dbcmd.aggregate).
I’m honestly not familiar with `cursor.allowDiskUse` (in fact I didn’t know it
existed until today), but from the documentation the two are quite different.
I think whether we should set `find(allow_disk_use=True)` depends on what we
want `MongoToS3Operator.allow_disk_use` to mean. The docstring says
> allow_disk_use: in the case you are retrieving a lot of data, you may have
to use the disk to save it instead of saving all in the RAM
which seems to indicate it probably makes sense to set
`find(allow_disk_use=True)` from it. But then the question becomes how we can
pass it only to MongoDB (not pymongo!) 4.4+ (released in July 2020) because it
would crash on earlier versions.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]