Hey Mohamed, Great to hear from you again. Sorry for taking a bit to reply. Excellent prototype, it's exactly what I had in mind. Very nice work. I checked it out and ran it and it worked perfectly. Using Jackson for JSON handling makes perfect sense, it's used very heavily in AsterixDB internally as well for a variety of things. Those next steps sound great. You might find this document useful for the low level details of the data model https://cwiki.apache.org/confluence/display/ASTERIXDB/AsterixDB+Object+Serialization+Reference ADM is basically an extension of JSON. I think even that document is a bit out of date- we also support 'geometry' type which is a GeoJSON field. Some kind of integration test also sounds good. It's always a bit tricky coordinating things between two projects that depend on one another, so feel free to ask about any questions or difficulties you come across in that.
To answer your two questions: 1. Sure, you can send the proposal draft to me and I'd be happy to give it a look. 2. I would defer to the GSOC guide about this, but I think it means 350 hours over the 12 weeks of the program. It doesn't have to be 8 hours every day, but it should be about that much. I put this project as "large" because even though it's conceptually straightforward, there is a lot of surface area that is important to get right. I think the code in the area of Metadata management is also kind of convoluted and hard to read. I didn't want there to be time pressure when there's a lot of sticky details. If it ends up being a bit easier than it seems, there are some adaptations or extensions to the project that could easily fill up the rest of the time. For example, if it seems like the translator is working perfectly externally, then the next step could be integrating it into the main codebase as a datasource or special function that returns everything. There's also many variants in terms of how to represent the backups of datasets. The most straightforward, conceptually, is to have inserts for each record. However INSERT statements have poor performance in AsterixDB for a variety of reasons. Therefore one improvement could be to dump datasets as JSONL files, and then have the DDLs to load them instead COPY FROM statements, or LOAD statements. Best, - Ian On Fri, Mar 27, 2026 at 6:34 AM Mohamed Hossam <[email protected]> wrote: > > Hey everyone, > > I'm Mohamed Hossam, a recent CS graduate who's interested in database > systems. I'm currently writing a proposal for the "Backup/restore utility > for AsterixDB [ASTERIXDB-3697 > <https://issues.apache.org/jira/browse/ASTERIXDB-3697>]" project for GSoC > 2026. I emailed the potential mentor for this project a couple of weeks > ago, and he instructed me to run AsterixDB locally and investigate the code > in asterixdb-metadata. So, I did as suggested by my mentor and I'm excited > to share what I learnt. > > I managed to create a very minimal prototype that can query AsterixDB and > generate basic CREATE and INSERT statements from its data. You can find my > work at: [m0hossam/asterixdb-dump > <https://github.com/m0hossam/asterixdb-dump/>]. I used FasterXML's Jackson > JSON parser and tried to recreate the metadata objects from the parsed > JSON. Of course, this is only a proof of concept, I'm deliberately ignoring > complex statements just to get the prototype up and running to test > the feasibility of the project. The actual implementation will require a > deeper understanding of AsterixDB's metadata and catalog. > > My next steps would be: > > - Dive deeper into asterixdb-metadata and gain a better understanding of > the data model. > - Potentially contribute to AsterixDB if I find something to improve in > the relevant code areas. > - Write automated unit tests to compare queries from the original > database with queries from the database generated by my prototype's dump, > ensuring database integrity. > > > I have two questions regarding the project: > > 1. Should I send my technical proposal to the potential mentor of the > project for review before submitting it through the official website? > 2. The project size is supposedly "~350 hour (large)". What does this > mean in terms of time commitment? Will the project have an extended > timeline? Or does the project require 8 hours of work per day during the 3 > months of coding? > > > Best regards, > Mohamed Hossam
