Also, I plan to do a Windows release once I setup the CI for Windows and after I get the major unit tests to pass. It would still contain winutils though. However, we can do another release after deprecating winutils.
Thanks, --Gautham On Mon, 14 Nov 2022 at 23:34, Gautham Banasandra <gaur...@apache.org> wrote: > Hi Iñigo, > > I would like to aim for winutils deprecation by the end of the first > quarter of 2023. > It really depends on how fast I can wrap up with setting up CI for > Windows. Given > that this involves getting Yetus to work properly on Windows, I feel it's > a bit > ambitious. But if things fall into place, I think end of the first quarter > of 2023 would > be a reachable timeline. > > Thanks, > --Gautham > > On Sat, 12 Nov 2022 at 00:20, Iñigo Goiri <elgo...@gmail.com> wrote: > >> Gautham, thank you very much for the summary. >> Do you have a time-line for when we can get rid of winutils? >> My idea was to get this and the YARN federation hardening work into a 3.4 >> release. >> >> >> >> On Fri, Nov 11, 2022, 10:15 Gautham Banasandra <gaur...@apache.org> >> wrote: >> >>> Hi folks, >>> >>> >>> What have we done so far? >>> ------------------------------------ >>> Inigo and I have been working for quite some time now on this topic, >>> but our efforts have mostly been oriented towards making Hadoop >>> cross-platform compatible. Our focus has been on streamlining the >>> process of building Hadoop on Windows so that one can easily >>> build and run Hadoop, just like on Linux. We reached this milestone >>> quite recently and I've documented the steps for doing so here - >>> >>> https://github.com/apache/hadoop/blob/5bb11cecea136acccac2563b37021b554e517012/BUILDING.txt#L493-L622 >>> >>> >>> >>> Is winutils still required? >>> ------------------------------- >>> As Steve mentioned, we would still require winutils for running >>> Hadoop on Windows. The major change here is that winutils >>> need not come from a third-party repository anymore, rather it >>> gets built along with the Hadoop codebase itself henceforth. >>> However, I agree that we need to deprecate winutils and >>> replace it with something better so that Hadoop users can have >>> a smoother experience. >>> >>> >>> What's the best way to deprecate winutils? >>> -------------------------------------------------------- >>> Over all the time that I've spent making Hadoop cross-platform >>> compatible, I've realized that the best way would be to have a >>> JNI interface that wraps around a native layer. This native layer >>> could be implemented majorly in C++. C++17 provides the >>> std::filesystem namespace that can satisfy most of the native >>> filesystem API requirements. Since std::filesystem is part of "The >>> Standard Libray", these APIs will be present on most/all the C++ >>> compilers of the various OS platforms. For those parts that can't >>> be satisfied by std::filesystem, we'll have to delve into this part >>> by writing C code that makes system calls. Please note that >>> these C files will need to be implemented specifically for each >>> platform. I took this approach when I wrote x-platform library to >>> make HDFS native client cross-platform compatible - >>> >>> https://github.com/apache/hadoop/tree/trunk/hadoop-hdfs-project/hadoop-hdfs-native-client/src/main/native/libhdfspp/lib/x-platform >>> >>> >>> What am I focussing on currently? >>> ------------------------------------------------ >>> So far, I've focussed on getting the build to work seamlessly >>> on Windows. I'm now trying to protect this from breaking by >>> setting up CI on Jenkins that builds Hadoop on Windows >>> for the precommit validation - >>> https://issues.apache.org/jira/browse/INFRA-23809 >>> Yes, it does involve getting >>> Yetus to run on Windows. I can work on deprecating winutils >>> after this. >>> >>> Thanks, >>> --Gautham >>> >>> On Fri, 11 Nov 2022 at 19:51, Steve Loughran <ste...@cloudera.com.invalid> >>> wrote: >>> >>>> It's time to reach for the axe. >>>> >>>> We haven't shipped eight version of Apache hadoop which builds and runs >>>> on >>>> windows for a long long time. I the only people trying to use the >>>> library >>>> is on windows Will have been people trying to use spark on their laptops >>>> with "small" dataset of only a are few tens of gigabytes at a time, the >>>> kind of work where 32GB of ram and 16 cores is enough. Put differently: >>>> in >>>> storage and performance of Single laptop means that it is perfectly >>>> suitable for doing reasonable amounts of work and the main barrier to >>>> doing >>>> so is getting a copy of the winutils lib. >>>> >>>> I know Gautham and Inigo I trying to get windows to work as a location >>>> for >>>> yarn again; not sure about hdfs. And there, yes, we have to say "they >>>> likely to need an extra binary" >>>> >>>> But for someone wanting to count the number of rows in an avro file? do >>>> a >>>> simple bit of filtering on some parquet data? Is these are the kind of >>>> things that anyone with a linux/mac laptop can do with ease and it is >>>> not >>>> fair to put suffering on to others. And well we could just say "why do >>>> you >>>> just install Lynnox on that laptop then?", I have someone who has had a >>>> Linux laptop for many years I know the written strong arguments against >>>> it >>>> even beyond the "my employer demand windows with their IT software" as >>>> "a >>>> latop which comes out of sleep reliably" is kind of important too. >>>> >>>> I how can we let the people who have to live in this world – And we have >>>> someone who is clearly willing to help –Live a better life. Funnily >>>> enough, >>>> the fact that we have not shipped a working version of when you tails >>>> for a >>>> long time actually gives us an advantage: we can pick incompatible >>>> changes >>>> and be confident that most people aren't going to notice. >>>> >>>> I think a good first step would be for Shell to work well if winutils >>>> isn't >>>> around -get rid of that static, WINUTILS string and path/file >>>> equivalents, >>>> the ones deprecated in 2015. We can rip them out knowing no external >>>> code >>>> is using them. >>>> >>>> Then we should look very closely at FileUtil to see how much of that is >>>> needed and how can we isolate it better. If you look at the change log >>>> of >>>> that file, we do have to consider that every time it execs a shell >>>> command >>>> I there's a security risk and more than once we've had to fix it. Not >>>> executing any external shell commands is good everywhere. >>>> >>>> >>>> >>>> >>>> On Thu, 10 Nov 2022 at 19:00, Chris Nauroth <cnaur...@apache.org> >>>> wrote: >>>> >>>> > Symlink support on the local file system is still used. One example I >>>> can >>>> > think of is YARN container launch [1]. >>>> > >>>> > I would welcome removal of winutils, as already described in various >>>> JIRA >>>> > issues. I think the biggest challenge we'll have is testing of a >>>> transition >>>> > from winutils to the newer Java APIs. The contract tests help, but >>>> > historically there was also a tendency to break things in downstream >>>> > dependent projects. >>>> > >>>> > I'd suggest taking this on piecemeal, transitioning small pieces of >>>> > FileSystem off of winutils one at a time. >>>> > >>>> > [1] >>>> > >>>> > >>>> https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/java/org/apache/hadoop/yarn/server/nodemanager/containermanager/launcher/ContainerLaunch.java#L1508-L1509 >>>> > >>>> > Chris Nauroth >>>> > >>>> > >>>> > On Thu, Nov 10, 2022 at 10:33 AM Wei-Chiu Chuang <weic...@apache.org> >>>> > wrote: >>>> > >>>> > > > >>>> > > > >>>> > > > >>>> > > > * Bare Naked Local File System v0.1.0 doesn't (yet) support >>>> symlinks >>>> > > > or the sticky bit. >>>> > > > >>>> > > ok to not support symlinks. The symlinks of HDFS are not being >>>> maintained >>>> > > and I am not aware of anything relying on it. >>>> > > So I assume people don't need it. >>>> > > >>>> > > Sticky bit would be useful, I guess. >>>> > > >>>> > > I suppose folks working at Microsoft would be more interested in >>>> this >>>> > work? >>>> > > Last time I heard, Gautham and Inigo were revamping Hadoop's Windows >>>> > > support. >>>> > > >>>> > > >>>> > > > * But the bigger issue is how to excise Winutils completely in >>>> the >>>> > > > existing Hadoop code. Winutils assumptions are hard-coded at >>>> a low >>>> > > > level across various classes—even code that has nothing to do >>>> with >>>> > > > the file system. The startup configuration for example calls >>>> > > > `StringUtils.equalsIgnoreCase("true", valueString)` which >>>> loads the >>>> > > > `StringUtils` class, which has a static reference to `Shell`, >>>> which >>>> > > > has a static block that checks for `WINUTILS_EXE`. >>>> > > > * For the most part there should no longer even be a need for >>>> > anything >>>> > > > but direct Java API access for the local file system. But >>>> muddling >>>> > > > things further, the existing `RawLocalFileSystem` >>>> implementation >>>> > has >>>> > > > /four/ ways to access the local file system: Winutils, JNI >>>> calls, >>>> > > > shell access, and a "new" approach using "stat". The "stat" >>>> > approach >>>> > > > has been switched off with a hard-coded >>>> `useDeprecatedFileStatus = >>>> > > > true` because of HADOOP-9652 >>>> > > > <https://issues.apache.org/jira/browse/HADOOP-9652>. >>>> > > > * Local file access is not contained within >>>> `RawLocalFileSystem` but >>>> > > > is scattered across other classes; `FileUtil.readLink()` for >>>> > example >>>> > > > (which `RawLocalFileSystem` calls because of the deprecation >>>> issue >>>> > > > above) uses the shell approach without any option to change >>>> it. >>>> > > > (This implementation-specific decision should have been >>>> contained >>>> > > > within the `FileSystem` implementation itself.) >>>> > > > >>>> > > > In short, it's a mess that has accumulated over years and getting >>>> > worse, >>>> > > > charging high interest on what at first was a small, >>>> self-contained >>>> > > > technical debt. >>>> > > > >>>> > > > I would welcome the opportunity to clean up this mess. I'm >>>> probably as >>>> > > > qualified as anyone to make the changes. This is one of my areas >>>> of >>>> > > > expertise: I was designing a full abstract file system interface >>>> (with >>>> > > > pure-Java from-scratch implementations for the local file system, >>>> > > > Subversion, and WebDAV—even the WebDAV HTTP implementation was >>>> from >>>> > > > scratch) around the time Apache Nutch was getting off the ground. >>>> Most >>>> > > > recently I've worked on the Hadoop `FileSystem` API contracting >>>> for >>>> > > > LinkedIn, discovering (what I consider to be) a huge bug in >>>> > > > ViewFilesystem, HADOOP-18525 >>>> > > > <https://issues.apache.org/jira/browse/HADOOP-18525>. >>>> > > > >>>> > > > The cleanup should be done in several stages (e.g. consolidating >>>> > > > WinUtils access; replacing code with pure Java API calls; >>>> undeprecating >>>> > > > the new Stat code and relegating it to a different class, etc.). >>>> > > > Unfortunately it's not financially feasible for me to sit here for >>>> > > > several months and revamp the Hadoop `FileSystem` subsystem for >>>> fun >>>> > > > (even though I wish I could). Perhaps there is job opening at a >>>> company >>>> > > > related to Hadoop that would be interested in hiring me and >>>> devoting a >>>> > > > certain percentage of my time to fixing local `FileSystem` >>>> access. If >>>> > > > so, let me know where I should send my resume >>>> > > > <https://www.garretwilson.com/about/resume>. >>>> > > > >>>> > > > Otherwise let me know if any ideas for a way forward. If there >>>> proves >>>> > to >>>> > > > be interest in GlobalMentor Hadoop Bare Naked Local FileSystem >>>> > > > <https://github.com/globalmentor/hadoop-bare-naked-local-fs> on >>>> GitHub >>>> > > > I'll try to maintain and improve it, but really what needs to be >>>> > > > revamped is the Hadoop codebase itself. I'll be happy when Hadoop >>>> is >>>> > > > fixed so that both Steve's code and my code are no longer needed. >>>> > > > >>>> > > > Garret >>>> > > > >>>> > > >>>> > >>>> >>>