We have been experimenting with other formats in our package ORFik, the fst format is also a good candidate, though the problem is that only R and Julia supports it currently. My biggest problems with bigwigs are the slow full file access time and not supporting multiple score columns (as far as I know).
Sent from Outlook for Android<https://aka.ms/AAb9ysg> ________________________________ From: Bioc-devel <bioc-devel-boun...@r-project.org> on behalf of Vincent Carey <st...@channing.harvard.edu> Sent: Friday, May 24, 2024 12:26:53 AM To: Chris Wilks (gmail) <broadsw...@gmail.com> Cc: Price, Amanda (NIH/NICHD) [E] <amanda.pr...@nih.gov>; Bioc-devel <bioc-devel@r-project.org>; Nina Rajpurohit <nina.rajpuro...@libd.org>; Jaffe, Andrew E. <andreweja...@gmail.com> Subject: Re: [Bioc-devel] Remote BigWig file access thanks On Thu, May 23, 2024 at 5:36 PM Chris Wilks (gmail) <broadsw...@gmail.com> wrote: > Thanks Vince, understood about the Core's focus right now. > > I think this is something that Leo and I can fix among ourselves for the > time being. > > Looking forward, as you brought up, if we were to refresh recount or > produce a recount4 (discussed) we'd certainly consider additional coverage > formats. > > I'm aware of tiledb though not duckdb (I'll have to check it out), thanks > for the pointer. > > There's also the D4 format from Aaron Quinlan's lab from a few years ago > which was explicitly designed to replace bigwigs: > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.nature.com%2Farticles%2Fs43588-021-00085-0&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591663672%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=DHohOJ341h1sk4SvxQDTMAzIBRk23qUCdKaKl1WrloQ%3D&reserved=0<https://www.nature.com/articles/s43588-021-00085-0> > > All that said, we're pretty committed to bigwigs at this point given the > ~750,000 sequence runs we've encoded using them for recount3. > > On Wed, May 22, 2024 at 7:17 AM Vincent Carey <st...@channing.harvard.edu> > wrote: > >> Really glad to see this discussion moving forward. I would say that the >> core is wrangling with some >> even lower-level technical concerns right now, so I can't jump in just >> now. I just want to raise the question >> of whether bigWig files are a technologically sound format to continue >> investing in for the use case of >> targeted remote query resolution on genomic coordinates. A number of new >> concepts have come into >> play since bigWig was designed and implemented. I'll naively mention >> duckdb and tiledb, which seem >> to have very good remote performance. Maybe these are too generic ... >> are there other concepts in >> GA4GH that might be relevant to leverage for recount-like projects in the >> future? >> >> >> >> On Wed, May 22, 2024 at 6:58 AM Chris Wilks (gmail) <broadsw...@gmail.com> >> wrote: >> >>> Thanks for sharing Leo, this does interest me, especially since so much >>> is >>> built on BigWig access via rtracklayer at least in the recount2 >>> ecosystem. >>> >>> As you alluded to, Megadepth currently supports remote access of BigWigs >>> (and BAMs) over HTTPS on all platforms (Linux, MacOS, and Windows), >>> getting back just the byte ranges overlapping the set of regions >>> requested >>> so it should work for at least recount2/recount3 and anything that uses >>> HTTP/s. >>> >>> I'd be open to exploring updates to the Megadepth C/C++ code side to >>> support Rle if that makes sense to replace rtracklayer. >>> But to do that you'd need to be involved in updating all the R packages >>> if >>> you're willing (both megadepth and those that currently rely on >>> rtracklayer >>> for this functionality). >>> >>> Let me know if you want to chat about this over Zoom, >>> Chris >>> >>> On Tue, May 21, 2024 at 2:41 PM Leonardo Collado Torres < >>> lcollado...@gmail.com> wrote: >>> >>> > Hi Bioc-devel, >>> > >>> > As some of you are aware, rtracklayer::import() has long provided >>> > access to import BigWig files. Those files can be shared on servers >>> > and accessed remotely thanks to all the effort from many of you in >>> > building and maintaining rtracklayer. >>> > >>> > From my side, derfinder::loadCoverage() relies on >>> > rtracklayer::import.bw(), and recount::expressed_regions() + >>> > recount::coverage_matrix() use derfinder::loadCoverage(). >>> > recountWorkflow showcases those recount functions on larger datasets. >>> > brainflowprobes by Amanda Price, Nina Rajpurohit and others also ends >>> > up relying on rtracklayer::import.bw() through these functions. >>> > >>> > At >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F83&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591674927%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=sH%2Ftb%2Bpd9fR2dA5KG8jrK%2BroY9AsgQveyxCDrX%2BIh0M%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/83> >>> > I initially >>> > reported some issues once our recount2/3 data host changed, but >>> > previously Brian Schilder also reported that one could no longer read >>> > remote files >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F73&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591682301%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=Cf21Kpi18LrhoS9ekBJfg8ZqcNyO28K2UqVgpMrg3OU%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/73>. >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F63&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591687305%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=zQAHfwRYJH25lXovMV5ceMKfgrJsWX8jNUpELb%2BMocI%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/63> >>> > and/or >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F65&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591691768%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=5YiBXZUZlLgFFBXPF2Wy6ZrR9YfKYbvY7VKiEivAUP8%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/65> >>> > might have been >>> > related. >>> > >>> > Yesterday I updated >>> > >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flawremi%2Frtracklayer%2Fissues%2F83%23issuecomment-2121313270&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591695920%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=jDkGCwhMer83WdbiV8b3jrNj0SebuXk8v%2BLdiGsOfGk%3D&reserved=0<https://github.com/lawremi/rtracklayer/issues/83#issuecomment-2121313270> >>> > with a comment showing some small reproducible code, and that the >>> > workaround of downloading the data first, then using >>> > rtracklayer::import() on the local data does work. However, this >>> > workaround does involve a lot of, hmm, wasteful data transfer. >>> > >>> > On the recount vignette at some point I access just chrY of a bigWig >>> > file that is about 1300 MB. On the recountWorkflow vignette I do >>> > something similar for a 7GB bigWig file. Previously accessing just >>> > chrY on these files was a small data transfer. >>> > >>> > On recountWorkflow version 1.29.2 >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FLieberInstitute%2FrecountWorkflow&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591699581%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=4UgE%2FgXjb9Jq42PUl60YykpmO3Fx57yydl64mmTL%2F8o%3D&reserved=0<https://github.com/LieberInstitute/recountWorkflow>, >>> > I've included >>> > pre-computed results (~2 MB) to avoid downloading tons of data, though >>> > the vignette code shows how to actually fully reproduce the results if >>> > you don't mind downloading those large files. I also implemented some >>> > workarounds on recount, though I haven't yet gone the full route of >>> > including pre-computed results. I have yet to try implementing a >>> > workaround for brainflowprobes. >>> > >>> > >>> > >>> > My understanding is that rtracklayer's root issues are elsewhere and >>> > changes in dependencies rtracklayer has likely created these problems. >>> > These problems are not always in the control of rtracklayer authors to >>> > resolve, and also create an unexpected burden on them. >>> > >>> > If one considers alternatives to rtracklayer, I see that there's a new >>> > package >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FPoisonAlien%2Ftrackplot%2Ftree%2Fmaster&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591703209%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=YtOcG8dga4CvpxmuMwnjUr5I8TGgngvlVai1Mhzh5Kg%3D&reserved=0<https://github.com/PoisonAlien/trackplot/tree/master> >>> > that uses >>> > bwtool (a system dependency), and older alternative >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fandrelmartins%2FbigWig&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591706974%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=fM%2BU%2BgYVpN7mkTQWSUzXBVImPmc9p0%2Ff2kfWb0rdJ%2BI%3D&reserved=0<https://github.com/andrelmartins/bigWig> >>> > that hasn't had updates in 4 >>> > years, and a CRAN package >>> > (https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcran.r-project.org%2Fweb%2Fpackages%2Fwig%2Freadme%2FREADME.html&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591710490%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9SvCcewvWvO7SCU%2Bch1YloTw5eYlqXR7uWiGcfKuPEQ%3D&reserved=0<https://cran.r-project.org/web/packages/wig/readme/README.html>) >>> > that >>> > recommends using rtracklayer for larger files. I guess that I could >>> > also try using megadepth >>> > https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fresearch.libd.org%2Fmegadepth%2F&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591714093%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=I6zyz4PtFwkhlZKir7qyPtFTO31Ld5qI0jVpfiFSvbg%3D&reserved=0<https://research.libd.org/megadepth/>, >>> > though >>> > derfinder::loadCoverage uses rtracklayer::import(as = "RleList") for >>> > efficiency >>> > >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Flcolladotor%2Fderfinder%2Fblob%2Ff9cd986e0c1b9ea6551d0d8d2077d4501216a661%2FR%2FloadCoverage.R%23L401&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591717632%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=PjCbc4clTWCMvYYBXqed%2FtcigAfeNpwLxSXTY2HSviQ%3D&reserved=0<https://github.com/lcolladotor/derfinder/blob/f9cd986e0c1b9ea6551d0d8d2077d4501216a661/R/loadCoverage.R#L401> >>> > and lots of functions in that package were built for that structure >>> > (RleList objects). I likely missed other alternatives. >>> > >>> > >>> > My current line of thought is to keep implementing workarounds using >>> > local data (sometimes with pre-computed results) for recount, >>> > recountWorkflow, and brainflowprobes (derfinder only has tests with >>> > local bigWig files) without really altering the internals of those >>> > packages. That is, assume that the remote BigWig file access via >>> > rtracklayer will indefinitely be suspended, though it could be >>> > supported again at some point and when it does, those packages will >>> > work again with remote BigWig files as if nothing ever happened. But I >>> > wanted to check in if this is what others who use BigWig files are >>> > thinking of doing. >>> > >>> > Thanks! >>> > >>> > Best, >>> > Leo >>> > >>> > >>> > Leonardo Collado Torres, Ph. D. >>> > Investigator, LIEBER INSTITUTE for BRAIN DEVELOPMENT >>> > Assistant Professor, Department of Biostatistics >>> > Johns Hopkins Bloomberg School of Public Health >>> > 855 N. Wolfe St., Room 382 >>> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fsearch%2F855%2BN.%2BWolfe%2BSt.%2C%2BRoom%2B382%2B%250D%250A%2BBaltimore%2C%2BMD%2B21205%3Fentry%3Dgmail%26source%3Dg&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591721275%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=hFNrN%2Bg5iY7hkXFsjfweaIFHuGOqH3d%2FsCQ60yU4V8g%3D&reserved=0<https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>> >>> >>> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fsearch%2F855%2BN.%2BWolfe%2BSt.%2C%2BRoom%2B382%2B%250D%250A%2BBaltimore%2C%2BMD%2B21205%3Fentry%3Dgmail%26source%3Dg&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591724906%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=qQynS72MuxoEE%2BBbr8wVgLVJ0CCRqUaPsqfDVGqlWyY%3D&reserved=0<https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>>> >>> Baltimore, MD 21205 >>> <https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fsearch%2F855%2BN.%2BWolfe%2BSt.%2C%2BRoom%2B382%2B%250D%250A%2BBaltimore%2C%2BMD%2B21205%3Fentry%3Dgmail%26source%3Dg&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591728513%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=52L7fIJLqcV2iWMFSv3fz0tkqcsainsoO8QDhMUg0EE%3D&reserved=0<https://www.google.com/maps/search/855+N.+Wolfe+St.,+Room+382+%0D%0A+Baltimore,+MD+21205?entry=gmail&source=g>> >>> > lcolladotor.github.io >>> > lcollado...@gmail.com >>> > >>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591732025%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=NNzJS0d4WZaADgL6jQ%2BqPD7mE7xzrO1EP%2FJmCI8Rfds%3D&reserved=0<https://stat.ethz.ch/mailman/listinfo/bioc-devel> >>> >> >> The information in this email is intended only for the person to whom it >> is addressed. If you believe this e-mail was sent to you in error and >> the email contains patient information, please contact the Partners >> Compliance HelpLine at >> https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.partners.org%2Fcomplianceline&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591735606%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=9jdYYMqbSMXBlYRzlOXWquF6GnugC9ze4GqQi75baz8%3D&reserved=0<http://www.partners.org/complianceline> >> . If the >> email was sent to you in error but does not contain patient information, >> please contact the sender and properly dispose of the email. > > -- The information in this email is intended only for the p...{{dropped:15}} _______________________________________________ Bioc-devel@r-project.org mailing list https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fbioc-devel&data=05%7C02%7C%7C3ba45fd4eedc4345092308dc7b778b9f%7C84df9e7fe9f640afb435aaaaaaaaaaaa%7C1%7C0%7C638521000591740728%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C0%7C%7C%7C&sdata=0wuwp5PAw1x4f6yBVUbcyTwT3MEkKbNQy9SEjuIXMXc%3D&reserved=0<https://stat.ethz.ch/mailman/listinfo/bioc-devel> [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel