Update: I'm investigating the possibility that I've reached the overcommit limit in the kernel as a result of all the parallel processes.
This still doesn't fix the client.release() problem but it might explain why the processing appears to halt, after some time, until I restart the Jupyter kernel. On Tue, Jul 10, 2018 at 12:27 PM Corey Nolet <cjno...@gmail.com> wrote: > Wes, > > Unfortunately, my code is on a separate network. I'll try to explain what > I'm doing and if you need further detail, I can certainly pseudocode > specifics. > > I am using multiprocessing.Pool() to fire up a bunch of threads for > different filenames. In each thread, I'm performing a pd.read_csv(), > sorting by the timestamp field (rounded to the day) and chunking the > Dataframe into separate Dataframes. I create a new Plasma ObjectID for each > of the chunked Dataframes, convert them to RecordBuffer objects, stream the > bytes to Plasma and seal the objects. Only the objectIDs are returned to > the orchestration thread. > > In follow-on processing, I'm combining the ObjectIDs for each of the > unique day timestamps into lists and I'm passing those into a function in > parallel using multiprocessing.Pool(). In this function, I'm iterating > through the lists of objectIds, loading them back into Dataframes, > appending them together until their size > is > some predefined threshold, and performing a df.to_parquet(). > > The steps in the 2 paragraphs above are performing in a loop, batching up > 500-1k files at a time for each iteration. > > When I run this iteration a few times, it eventually locks up the Plasma > client. With regards to the release() fault, it doesn't seem to matter when > or where I run it (in the orchestration thread or in other threads), it > always seems to crash the Jupyter kernel. I'm thinking I might be using it > wrong, I'm just trying to figure out where and what I'm doing. > > Thanks again! > > On Tue, Jul 10, 2018 at 12:05 PM Wes McKinney <wesmck...@gmail.com> wrote: > >> hi Corey, >> >> Can you provide the code (or a simplified version thereof) that shows >> how you're using Plasma? >> >> - Wes >> >> On Tue, Jul 10, 2018 at 11:45 AM, Corey Nolet <cjno...@gmail.com> wrote: >> > I'm on a system with 12TB of memory and attempting to use Pyarrow's >> Plasma >> > client to convert a series of CSV files (via Pandas) into a Parquet >> store. >> > >> > I've got a little over 20k CSV files to process which are about 1-2gb >> each. >> > I'm loading 500 to 1000 files at a time. >> > >> > In each iteration, I'm loading a series of files, partitioning them by a >> > time field into separate dataframes, then writing parquet files in >> > directories for each day. >> > >> > The problem I'm having is that the Plasma client & server appear to >> lock up >> > after about 2-3 iterations. It locks up to the point where I can't even >> > CTRL+C the server. I am able to stop the notebook and re-trying the code >> > just continues to lock up when interacting with Jupyter. There are no >> > errors in my logs to tell me something's wrong. >> > >> > Just to make sure I'm not just being impatient and possibly need to wait >> > for some background services to finish, I allowed the code to run >> overnight >> > and it was still in the same state when I came in to work this morning. >> I'm >> > running the Plasma server with 4TB max. >> > >> > In an attempt to pro-actively free up some of the object ids that I no >> > longer need, I also attempted to use the client.release() function but I >> > cannot seem to figure out how to make this work properly. It crashes my >> > Jupyter kernel each time I try. >> > >> > I'm using Pyarrow 0.9.0 >> > >> > Thanks in advance. >> >