TheNeuralBit commented on a change in pull request #13286:
URL: https://github.com/apache/beam/pull/13286#discussion_r520979308
##########
File path: sdks/python/apache_beam/dataframe/pandas_docs_test.py
##########
@@ -33,10 +39,15 @@
PANDAS_DIR = os.path.expanduser("~/.apache_beam/cache/pandas-" +
PANDAS_VERSION)
PANDAS_DOCS_SOURCE = os.path.join(PANDAS_DIR, 'doc', 'source')
+parallelism = None
+
def main():
- # Not available for Python 2.
- import urllib.request
+ parser = argparse.ArgumentParser()
+ parser.add_argument('-p', '--parallel', type=int, default=0)
Review comment:
nit: add a help string indicating the default is 0, which will use the
cpu count
##########
File path: sdks/python/apache_beam/dataframe/pandas_docs_test.py
##########
@@ -74,22 +85,56 @@ def main():
if any(filter in path for filter in filters):
paths.append(path)
+ # Using a global here is a bit hacky, but avoids pickling issues when used
+ # with multiprocessing.
+ global parallelism
Review comment:
Is this needed because parallelism is used in run_tests? Instead of
branching on parallelism inside run_tests could we just make a different method
for the parallel vs single-threaded cases?
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]