Thank you so much for your response. Here are the results of the tests you
sent:
Verbose: This seems to have made the same number of files this time; not
sure why the other 3-4 times I ran it it did not. They appear to be the
same size, with paired last reads

(base) [hwick@zappalogin interactive_with_verbose]$ cat
make_chunks_1_1mill_verbose

DHT_R1 exit code: 0

DHT_R2 exit code: 0

  96 DHT_R1.log

  96 DHT_R2.log

 192 total
Version:

(base) [hwick@zappalogin test_2019]$ split --version

split (GNU coreutils) 8.4

Copyright (C) 2010 Free Software Foundation, Inc.

License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html
>.

This is free software: you are free to change and redistribute it.

There is NO WARRANTY, to the extent permitted by law.


Written by Torbjörn Granlund and Richard M. Stallman.

STDERR:
The only thing in the stderr file is an odd duck of:

-sh: module: line 1: syntax error: unexpected end of file

-sh: error importing function definition for `BASH_FUNC_module'

Python 3.6.8 :: Anaconda, Inc.

/bin/sh: module: line 1: syntax error: unexpected end of file

/bin/sh: error importing function definition for `BASH_FUNC_module'

but this prints for every job I run with this particular flavor of conda/bash
and doesn't seem to affect anything else (as far as I know)
All jobs finished well below allotted memory and with exit status 0, even
when split didn't make the right number of output files.

Do you know any reason why the behavior would be inconsistent?

Pairing check: unfortunately my server's version of bash doesn't support
paste in this way, I've run into this issue before but I forget what the
workaround is. I can't run this command interactively because my server
times out (these files are > 3 billion lines each, so it takes a long time
to zcat them)

/cm/local/apps/sge/var/spool/zappa-06/job_scripts/358558: line 10: syntax
error near unexpected token `('

/cm/local/apps/sge/var/spool/zappa-06/job_scripts/358558: line 10: `paste
<(zcat MH1_R2.fastq) <(zcat MH1_R2.fastq.gz) \'

On Fri, Jun 7, 2019 at 11:39 PM Assaf Gordon <assafgor...@gmail.com> wrote:

> Hello,
>
> On Fri, Jun 07, 2019 at 09:48:44PM -0400, Heather Wick wrote:
> > Yes, sorry, I should have specified that I already checked that the
> > original fastq files are indeed paired and sorted with the same number of
> > lines and same starting/ending IDs, narrowing down the issue to a problem
> > with split.
>
> It could be a problem with "split", but we'll need to dig a bit deeper
> to be able to pinpoint the exact issue.
>
> Could you please try the following commands and post the results?
>
>     zcat MH1_R1.fastq.gz \
>        | split --verbose -l 40000000 - DHT_R1_ > DHT_R1.log ; echo DHT_R1
> exit code: $?
>     zcat MH1_R2.fastq.gz \
>        | split --verbose -l 40000000 - DHT_R2_ > DHT_R2.log ; echo DHT_R2
> exit code: $?
>     wc -l DHT_R1.log DHT_R2.log
>
> Two more questions:
> 1. can you post the result of "split --version" ?
> 2. You mentioned "jobs" - if you are running these as submitted jobs on
> a cluster (e.g. with "qsub"), can you double-check the STDERR log files
> to ensure no errors where encountered ?
>
> If we still can't pinpoint the issue, the next steps would be to check
> the DHT_R{1,2}.log files, and then try to compare the content of the
> splitted files.
>
> I assume the input files are indeed correctly paired, but just to check,
> if you could try the following command, it should not print anything
> to the screen (indicating all sequence IDs are paired):
>
>     paste <(zcat MH1_R2.fastq) <(zcat MH1_R2.fastq.gz) \
>        | awk 'NR%4!=1 { next } $1!=$3 { print "Error in line " NR ":" $1 "
> vs " $3 }'
>
> regards,
>  - assaf
>
>
>

-- 
Heather Wick
PhD Candidate, Human Genetics
Labs of Sarah Wheelan and Vasan Yegnasubramanian
Institute of Genetic Medicine
Johns Hopkins University School of Medicine
hwi...@jhmi.edu

Reply via email to