Hi all,
I am a novice of GNU Parallel, and after reading the content and discussion of
A Million Text Files And A Single Laptop, I want to
make sure whether my understanding between "ls | parallel -m -j $f “cat {} >>
../transactions_cat/transactions.csv”" and
"ls | parallel -m -j $f cat {} >> ../transactions_cat/transactions.csv" is
right:
(1) ls | parallel -m -j $f “cat {} >> ../transactions_cat/transactions.csv”
In this case, the job should be:
job 1: cat file1 >> ../transactions_cat/transactions.csv
job 2: cat file2 >> ../transactions_cat/transactions.csv
job 3: cat file3 >> ../transactions_cat/transactions.csv
......
Since the output to "../transactions_cat/transactions.csv" belongs to the job,
it is out of GNU Parallel's control. So there exists
the contention issue that multiple processes write to the same file currently,
may be a lock is needed.
(2) ls | parallel -m -j $f cat {} >> ../transactions_cat/transactions.csv
In this case, the job should be:
job 1: cat file1
job 2: cat file2
job 3: cat file3
......
since the output to "../transactions_cat/transactions.csv" is parallel's
responsibility, it is in GNU Parallel's control. The GNU parallel
can buffer the output of every job, and write them to
"../transactions_cat/transactions.csv" one by one, so this can make sure the
output
of different jobs can't mix up.
Do I understand right? If not, could someone give some corrections?
Thanks in advance!
Best Regards
Nan Xiao (肖楠)
Skype: xiaonan19830818
Jabber/XMPP: [email protected]
Telegram: nanxiao
Personal website (Chinese): http://nanxiao.me/
Personal website (English): http://nanxiao.me/en
Chinese DTrace website: http://chinadtrace.org/