Re: [Genome] Extract orthologous regions from MAF with mafFrags

Zhenguo Zhang Fri, 04 Feb 2011 15:04:50 -0800

Dear Brian,

Thank you for this information. I will try mafsInRegion (which seems very
slow). I have a few questions related to the net alignment.

1. During the net construction with tool chainNet, some (long) chains are
trimmed (matching boundaries of parent gaps) when filling gaps in parent
chain, so only for these chains only the trimmed portion is recorded in net
file. Am I right?

2. If above is true, and I know that the axtNet files in
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/vsRheMac2/axtNet are generated
by filtering the all chains (file hg19.rheMac2.all.chain.gz) using netToAxt.
Then do the alignments in axtNet files contain only trimmed portion for
those trimmed chains or the whole chain?

3. netChainSubset is used to generate liftOver files in directory
ftp://hgdownload.cse.ucsc.edu/goldenPath/hg19/liftOver. This process also
uses the net file to filter the chains, so what happens for those trimmed
chains? Is the option -wholeChains used in your procedure?

4. Based on my understanding, the axtNet files and liftOver files are two
different formats (axt and chain, respectively) of the same chains, both
filtered by net files. Am I correct? If not, what is the difference?

5. Multiz46way alignment uses pairwise alignments produced above. Do they
just include the trimmed alignments from trimmed chains? In the conservation
track description page, it says that an additional filtering step was
introduced to reduce the number of paralogs and pseudogenes.  I think that
in net files for a given target region (say hg18) there is only one best
aligned region (if available) from query species (say rheMac2). What does
'reduce the number of paralogs' mean? Does it operate on query species?

Sorry I have so many questions, but it is important for me to manipulate the
data correctly and efficiently. Your help is greatly appreciated and I am
happy to cite UCSC. Thank you!

Zhenguo

2011/2/4 Brian Raney <[email protected]>

> Hey Zhenguo,
>
> I think for your purposes, the mafsInRegion program will be better to
> use than mafFrags.  mafsInRegion does not coalesce blocks, so it will
> maintain the query addresses, and it won't put in the '.'s.
>
> Our standard alignment procedure calls the chainNet program (used to
> create the net's from the chains) with a parameter that sets the
> minimum gap to fill to be one base.   Whether N's are considered to be
> a gap or not depends on how many of them there are, and whether the
> chaining process included them in the chain.  In general, if there are
> only a few N's, the chains will include them, so they won't be
> considered a gap in the net.
>
> I hope this answers your questions.   If you have follow-up questions,
> please address them to this list.
>
> Brian
>
> On Thu, Feb 3, 2011 at 9:51 AM, Zhenguo Zhang <[email protected]>
> wrote:
> > Dear Colleagues,
> >
> > I am trying to get the orthologous/homologous regions from hg18
> multiz28way
> > maf files for a list of human coordinates. I have read the documents and
> > track descriptions, but still have several questions:
> >
> > 1. The results from mafFrags lost the genomic coordinate information for
> all
> > the query species, and one maf block for each coordinate region in input
> bed
> > file. It may contain genomic breaks in some species, that is, the
> > corresponding sequence for a query species (say, rheMac2) in the maf
> block
> > is composed of different genomic locations (from the same or different
> > chromosomes). I derive this based on  the net alignment construction
> > procedure. I need to know this information and exclude these cases
> because
> > they are artificial and not true sequences. How can I get the genomic
> breaks
> > in the results from mafFrags? When I checked the alignment in the UCSC
> > browser, they are given in different blocks, which seems based on genomic
> > breaks in any species.
> >
> > 2. In the mafFrags results, there are two symbols indicating gaps. One is
> > dash '-', and the other is dot '.'. I think dash is the real gaps in
> query
> > sequences, but what does the dot represent? Does it represent unsequenced
> > regions or gaps too?
> >
> > 3. In the net alignment construction, the gaps in the top level chain are
> > filled by trimmed lower-scoring chains. What is the minimum length for
> gaps
> > to be filled? Is 1-base long gap in top level chain also filled by
> > lower-scoring chain?
> >
> > 4. During net alignment process, are the unsequenced regions (N's)
> regarded
> > as gaps and filled during this process?
> >
> > Thank you in advance!
> >
> > Zhenguo
> > --
> > ——————————————————————
> > Zhenguo Zhang
> > Postdoctoral Scholar
> > Institute of Molecular Evolutionary Genetics
> > Penn State University
> > 312 Mueller Lab, University Park, PA 16802
> > Tel: 814-865-2796
> > Homepage: http://www.personal.psu.edu/zuz17/
> > Lab:  http://homes.bio.psu.edu/people/Faculty/Nei/
> > _______________________________________________
> > Genome maillist  -  [email protected]
> > https://lists.soe.ucsc.edu/mailman/listinfo/genome
> >
>

-- 
——————————————————————
Zhenguo Zhang
Postdoctoral Scholar
Institute of Molecular Evolutionary Genetics
Penn State University
312 Mueller Lab, University Park, PA 16802
Tel: 814-865-2796
Homepage: http://www.personal.psu.edu/zuz17/
Lab:  http://homes.bio.psu.edu/people/Faculty/Nei/
_______________________________________________
Genome maillist  -  [email protected]
https://lists.soe.ucsc.edu/mailman/listinfo/genome

Re: [Genome] Extract orthologous regions from MAF with mafFrags

Reply via email to